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Piecewise  linear  classifiers  constitute  a  group  of  classifiers 
often  used  in  real  time  pattern  recognition.  In  most  cases 
they  are  reliable  and  sufficiently  fast.  In  this  work,  new 
algorithms  for  generating  piecewise  linear  classifiers  are 
developed,  which  can  easily  handle  multi  class  problems.  Using 
only  a  small  number  of  discriminant  functions,  they  attempt  to 
create  a  classifier  with  low  error  rate.  In  brief,  the  concept 
consists  of  first  splitting  the  sample  space  of  a  given  class 
into  two  subsample  spaces.  The  samples  belonging  to  a  given 
class  which  are  lying  in  the  same  (sub) sample  space  are  said 
to  belong  to  the  same  subclass.  Next,  the  (sub) classes  are 
separated'  using  linear  classifiers.  Then,  a  new  (sub)class  is 
split  and  a  classifier  based  on  this  splitting,  is  created. 
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ABSTRACT  (continued) 

This  process  continues  until  the  overall  performance  is  not 
improved  by  further  splitting.  Our  classifiers  have  been 
compared  with  other  relevant  classifiers  such  as  Bayes  classi¬ 
fier  (for  known  distributions),  Bayes  classifier  with 
estimated  densities,  the  nearest  neighbour  rule  as  well  as 
previously  developed  piecewise  linear  classifiers.  So  far 
tests  have  shown  that  our  algorithms  are  working  veiy  weii. 
They  produce  fast  and  reliable  classifiers  which  in  some  cases 
have  been  found  superior  to  the  classifiers  used  for  the 
comparision.  -  ,  s 
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ON  ALGORITHMS  FOR  GENERATING  COMPUTATIONALLY 
SIMPLE  PIECEWISE  LINEAR  CLASSIFIERS 


1  INTRODUCTION 

In  many  applications  one  has  to  assign  a  given  object  to  one  out  of  several 
possible  categories  (classes).  Theories  for  making  such  decisions,  also  called 
classification-  or  pattern  recognition-theory,  can  be  dated  back  to  Fisher’s 
classical  work  in  1936  [12].  However,  it  was  not  until  the  development  of  the 
electronic  computer  some  35  years  ago  that  considerable  advances  were  made. 
The  range  of  pattern  recognition  applications  is  now  very  broad.  A  few 
examples  are: 


-  Recognition  of  moving  objects  (vehicles,  human  beings  etc)  in  TV-images, 
for  automatic  surveillance. 

-  Classification  of  remotely  sensed  data,  i.e.  assigning  each  picture  element 
of  a  satellite  image  to  one  out  of  a  given  set  of  classes  (acres,  forests, 
roads,  etc). 

-  Medical  diagnostics,  such  as  analysis  of  X-ray  images  (e.g.  detection  of 
cancer),  predicting  relapse  in  pulmonary  tuberculosis  suffers. 

-  Waveform  classification,  e.g.  speech  recognition,  seismic  analysis  (i.e. 
discrimination  between  earthquakes  and  nuclear  explosions),  target 
recognition  from  radar  echoes  and  electroencephalogram  analysis. 


In  statistical  pattern  recognition  each  object  is  being  represented  by  a  so  called 
feature  vector.  Each  component  of  the  vector  is  a  measured  value  of  some 
feature  of  the  object.  The  feature  vector  is  input  to  the  classifier  which  in  turn 
assigns  the  object  to  one  of  the  classes.  Usually  discriminant  functions  are  used 
for  classification  purposes.  These  are  a  set  of  confidence  functions,  one  function 
for  each  class.  The  function  value  increases  with  the  confidence,  and  an  object 
is  assigned  to  the  class  with  the  maximum  discriminant  value. 
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Many  of  the  problems  we  are  concerned  with,  are  to  be  used  in  a  real  time 
environment  only.  In  real  time  applications  fast  decision  making  is  a  must. 
Hence  only  computationally  simple  discriminant  functions  such  as  those  with 
linear,  quadratic  or  piecewise  linear  discriminant  functions  may  be  used. 

The  linear  classifiers  are  easy  to  implement  and  very  fast  to  compute.  Their 
error  rate  may,  however,  be  quite  high  unless  the  classes  are  (almost)  linearly 
separable.  The  quadratic  classifiers  are  usually  more  reliable  than  the  linear 
ones.  Moreover  they  are  almost  as  fast  (assuming  a  “small”  number  of  features 
to  be  used).  Unfortunately  the  linear  and  the  quadratic  classifiers  are  not 
generally  capable  of  handling  data  sets  with  multiple  modes  and/or  concave 
data  sets  satisfactorily.  Opposed  to  those  classifiers,  the  piecewise  linear 
classifiers  manage  to  give  reliable  results  in  such  situations  as  well.  However, 
traditional  piecewise  linear  discriminant  functions  may  be  quite  complex  and 
computationally  heavy  if  classes  are  partly  overlapping  in  feature  space. 

In  this  thesis  new  algorithms  for  generating  a  piecewise  linear  classifier  are 
developed.  The  goal  is  to  create  a  classifier  which  combines  the  linear  classifier’s 
speed  with  high  reliability.  Roughly  speaking,  the  strategy  is  as  follows:  The 
sample  space  of  each  class  is  split  into  several  disjoint  subsample  spaces.  The 
samples  included  in  the  same  subsample  space  are  said  to  belong  to  the  same 
subclass.  Linear  discriminant  functions  separating  the  subclasses  are  then 
constructed  and  employed  to  discriminate  between  the  classes.  To  apply  this 
method,  several  difficulties  must  be  solved.  Among  the  most  important 
problems  are 


-  how  to  split  a  given  (sub)sample  space, 

-  how  to  terminate  the  splitting, 

-  how  to  generate  linear  discriminant  functions  for  separating  the 
(sub)classes. 


In  the  next  chapter  a  brief  introduction  to  pattern  recognition  is  given. 

Chapter  3  describes  the  new  methods,  while  in  chapter  1  the  new  algorithms  are 


developed  and  evaluated.  In  chapter  5  their  performance  are  compared  with 
that  of  other  classifiers.  Chapter  6  contains  the  summary  and  conclusion. 


2  INTRODUCTION  TO  PATTERN  RECOGNITION 
2.1  Basic  concepts 

In  statistical  pattern  recognition  each  object  is  represented  by  a  feature  vector 
x*  =  [xi,X25  •  •  • ,  x<*|  containing  a  measurement  of  predefined  object  features. 
Which  features  to  use  and  how  many  depends  on  the  application.  We  will, 
however,  always  search  for  features  maximizing  the  separation  between  the 
classes.  Hopefully,  feature  vectors  from  objects  belonging  to  different  classes 
will  belong  to  separable  regions  in  feature  space. 

The  classification  is  usually  done  by  use  of  so-called  discriminant  functions. 
This  is  a  set  of  functions  </,(x),  one  function  for  each  class  u>,  ,  i  =  1, 2,  •  •  • ,  c. 
An  object  with  feature  vector  x  is  assigned  to  the  class  with  maximum 
discriminant  value.  That  is,  the  object  is  assigned  to  u>k  if 


9> t(x)  =  rnax{^(x)}  . 


(2.1) 


This  decision  rule  corresponds  to  dividing  the  feature  space  into  disjoint  regions 
Cli  ,  i  =  1, 2,  •  •  • ,  c  and  assigning  an  object  with  feature  vector  x  to  u>k  if 
x  €  fife.  A  hypersurface  which  separates  two  regions,  say  and  f lj,  is  a 
hypersurface  containing  vectors  x  where 


$.(x)  =  Sfj(x)  and 

SN(x)  >  S*(x)  k^i,j. 


(2.2) 


Figure  2.1  shows  a  3-class,  two  dimensional  feature  space  as  an  example. 


*1 

Figure  2.1:  Two  dimensional  feature  space,  3  classes. 

A  central  problem  in  the  classification  theory  is  how  to  design  efficient 
discriminant  functions.  Two  main  approaches  are  available;  supervised  and 
unsupervised  learning.  In  this  thesis  we  will  consider  the  supervised  learning 
approach  only. 

The  most  usual  way  of  designing  and  evaluating  discriminant  functions  is  shown 
in  figure  2.2  and  is  described  in  the  following.  In  supervised  learning  we  have 
available  a  data  set  where  the  true  class  of  each  feature  vector  (sample)  is  known 
in  advance.  Usually  the  data  set  is  first  divided  into  two  subsets,  one  design  set 
and  one  test  set.  The  design  set  is  then  used  for  generating  the  classifier 
(discriminant  functions).  The  test  set  is  used  for  the  evaluation.  The  reason  for 
using  an  independent  test  set  is  that  a  too  optimistic  performance  is  achieved  if 
design  and  testing  are  based  on  the  same  data.  Other  approaches  for  evaluating 
classifiers  (error  rate  estimation)  are  also  available  (e.g.  Hand  [16,  17]). 

The  image  processing  and  pattern  recognition  group  at  the  Norwegian  Defence 
Research  Establishment  (NDRE)  is  mostly  concerned  with  real  time 
applications  of  image  analysis/pattern  recognition.  Typically,  25  TV-images  are 
to  be  processed  each  second.  Thus  a  very  fast  decision  making  (classification) 
algorithm  is  needed.  However,  the  time  required  for  designing  (training)  the 


Figure  2.2:  Illustration  of  design  and  evaluation  of  a  classifier  using  the  supervised 
learning  approach. 

classifier  is  of  minor  importance  for  us. 


2.2  Linear  classifiers 

The  discriminant  functions  of  linear  classifiers  are  of  the  form 


d 

9i(x)  =  YlxJa'J  +  a«o  (2-3) 

;=1 


where  a,;  an®  real  coefficients.  By  defining  the  augmented  feature  vector 


and  the  weight  vector 
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a,  = 


'aM\ 

a,  1 


V  ) 


the  discriminant  function  in  2.3  may  be  expressed  by  the  inner  product  between 
a,  and  the  augmented  feature  vector  y: 


g,{x)  =  ajy  . 


It  is  easy  to  verify  that  the  hypersurface  separating  two  classes  is  a  hyperplane 
perpendicular  to  the  vector 


a  —  a,  -  a_,  . 


(2.4) 


Advantages  of  the  linear  classifier  are  the  computational  speed  and  ease  of 
implementation.  However,  the  error  rate  may  be  high  if  the  classes  are  not 
(alrr-'st)  linearly  separable. 


2.3  Quadratic  classifiers 

The  discriminant  functions  of  quadratic  classifiers  are  given  by 


<7«(x)  =  x‘  W,  X  +  wj  X  +  w. 


(2.5) 


where  W,  is  a  d  x  d-matrix,  w,  is  a  d-dimensional  vector  and  wt  is  a  scalar.  It. 
may  be  shown  that  the  decision  surfaces  between  two  classes  are  hyperquadratic, 
and  can  assume  several  forms:  Pairs  of  hyperplanes,  hyperspheres, 
hyperellipsoides  and  hyperparaboloids.  Quadratic  classifiers  usually  give  a  lower 
error  rate  than  linear  classifiers,  and  they  are  almost  as  fast  provided  that  d  is 
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“small”  [28].  In  fact,  one  may  easily  show  that  ^—1  multiplications  (and 
additions)  are  required  for  computing  the  discriminant  functions  for  each  class. 


In  figure  2.3  a  two  class  two-dimensional  example  is  shown.  300  samples  are 
generated  from  two  bivariate  Gaussian  distributions.  150  of  them  axe  generated 
from  a  distribution  with  mean  vector  [3,0]‘,  unit  variance  and  150  samples  from 


aJV  0, 


1  0 
0  9 


distribution.  The  linear  decision  boundary  is  found  by  using 


the  mean  squared  error  method,  and  the  quadratic  surface  is  found  by  using  the 
maximum  likelihood  method  assuming  Gaussian  distributions.  For  details  see 
Duda  and  Hart  [8].  From  the  figure  we  can  see  that  6.0%  of  the  samples  in  the 


Figure  2.3:  Examples  of  decisions  boundaries  when  using  linear  and  quadratic 
classifiers. 

design  set  are  misclassified  if  we’re  using  the  linear  classifier  and  4.7%  if  the 
quadratic  one  is  used.  Their  error  rates  axe  6.7%  and  5.0%  respectively. 


Unfortunately,  linear  and  quadratic  classifiers  will  in  general  not  be  able  to 
handle  data  sets  containing  multiple  modes  and/or  concavities  satisfactorily.  An 
example  is  given  in  figure  2.4  where  a  linear  classifier  is  used  to  discriminate 
between  two  “banana”-distributed  classes.  Although  the  classes  form  almost 
non-overlapping  clusters,  the  error  rate  is  quite  high.  The  same  will  be  true  for 
the  quadratic  classifier. 
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Figure  2.4:  An  example  where  both  the  linear  and  the  quadratic  classifiers  fail. 
2.4  Piecewise  linear  classifiers 

A  piecewise  linear  classifier,  abbreviated  PLC,  may  be  defined  as  a  classifier 
where  the  hypersurface  separating  two  classes  consists  of  hyperplanes.  In  other 
words,  the  hypersurface  is  piecewise  linear.  As  an  example,  a  piecewise  linear 
classifier  which  discriminates  well  between  the  two  banana  distributions  used  in 
figure  2.4  is  shown  in  figure  2.5. 

Many  algorithms  for  constructing  piecewise  linear  classifiers  have  been 
suggested.  The  most  important  ones  fail  into  one  of  the  two  following  groups: 


a)  Iterative  methods  based  on  a  given  (fixed)  number  of  discriminant 
functions  (number  of  weight  vectors). 

b)  Methods  which  increase  the  number  of  discriminant  functions  until  the 
design  set  is  correctly  classified. 

The  group  a)  -  algorithms  require  the  number  of  weight  vectors  to  be  given  in 
advance.  This  is  generally  a  disadvantage  because  it  often  has  to  be  chosen  by 
trial  and  error.  On  the  other  hand,  the  group  b)  -  classifiers  may  be  very 


Figure  2.5:  Example  of  a  piecewise  linear  classifier  with  low  error  rate. 

complicated.  This  is  especially  true  if  the  different  classes  in  the  design  set  are 
not  well  separated.  In  this  case  the  resulting  classifier  might  turn  out  to  be  too 
complex  and  time  consuming  for  many  applications.  We  will  now  briefly 
describe  the  most  important  classifiers  in  the  two  groups. 

Kesler  constructed  a  c-class  linear  classifier  [8]  which  in  turn  was  modified  to  a 
c-class  piecewise  linear  classifier  by  Duda  and  Fossum  [7].  Instead  of  using  only 
one  weight  vector  for  representing  each  class  (as  Kesler  did),  several  weight 
vectors  were  applied.  Assume  we  have  determined  n,  linear  discriminant 
functions  for  c (weight  vectors  a,-, , a,,,  •  •  • ,  a,-  ).  Then  an  object  with  feature 
vector  x  is  assigned  to  u>k  if 

^(x)  =  max{#(x)} 

l 

=  max  |  max  R,y}}  (2-6> 

where  y  denotes  the  corresponding  augmented  feature  vector. 


The  problem  is  how  to  train  the  classifier.  Kesler  modified  the  fixed  increment 
rule  for  linear  classifiers.  This  training  process  is  an  iterative  procedure,  making 
the  classifier  more  sensitive  to  the  misclassified  samples. 
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The  classifier  of  Duda  and  Fossum  is  based  on  2.6,  and  they  modified  Keslers 
rule  further  so  several  weight  vectors  could  be  trained  for  representing  each 
class.  Assume  for  a  given  augmented  feature  vector  y  from  u;, 

a^y  =  max{a|ky}  ,  k  =  1,2,- •  •  ,n< . 

Now,  a  correction  is  made  if  and  only  if  p,(x)  —  pt(x)  ,  A:  7^  i,  is  less  than  a 
margin  M.  In  other  words 


aj,y  -M  <  a‘fc|y 


k  ^  i 
M  >  0 


(2.7) 


and  the  following  new  weight  vectors  are  generated  in  the  m’th  correction, 
m  >  1 


ap,(m) 


a i,(m-  1)  +  wmy 
'  ajt,(m  -  1)  -  wmy 
,  ap,(m-l) 


,  p  =  i,  q  =  j 

,  p  =  k,  q  =  l 
,  otherwise 


(2.8) 


where 

0  <  wmin  <wm<  wmax  . 

The  constants  wmin ,  wmax,  and  the  margin  M  have  to  be  found  by  trial  and 
error,  and  the  way  of  generating  wm  has  to  be  known  in  advance. 

Chang  [3]  presented  an  algorithm  for  the  two-class  problem  where  the  piecewise 
linear  discriminant  function  p(x,  ai,  a2,  •  •  • ,  a„)  is  represented  in  terms  of  the  so 
called  maximum  and  minimum  functions.  For  example, 


p(x,  aj,a2,  •  •  • ,  an)  =  max  {min  {ajy,  •  •  • , a^y} ,  •  •  • ,  min  {a^+1y,  ■  •  • ,  aj,y}}  (2.9 


is  a  piecewise  linear  function,  and  y  is  still  the  augmented  vector  of  x.  We  also 
notice  that  2.9  is  a  generalization  of  2.6  (for  the  two-class  problem).  Moreover, 
a  sample  x  is  assigned  to  if  p(x,  ax,  a2,  •  •  • ,  a*)  >0,  otherwise  it  is  assigned  to 
u>2 ■  The  training  procedure  is  iterative,  and  the  weight  vectors  are  trained  one 
at  a  time.  In  iteration  i  +  1,  the  weight  vector  a &(:  +  1),  k  =  1, 2,  •  •  • ,  n  is  found 
by  using  a  linear  separation  procedure  on  the  data  set  Ak(i)-  This  is  a  subset  of 
the  design  set  containing  all  samples  for  which  p(x,ax,  •  •  ■ ,  a„)  =  a|.(z)y.  This 
algorithm  will  hopefully  find  a  suitable  solution  within  a  prespecified  number  of 
iterations. 

Unfortunately  the  combination  of  max  and  min  functions  of  the  piecewise  linear 
function  has  to  be  determined  by  trial  and  error.  This  may  be  a  very  difficult 
task. 

Takiyama  has  made  a  committee  machine  for  the  two-class  problem  [34].  The 
committee  machine  is  illustrated  in  figure  2.6.  The  decision  logic  and  the 


Figure  2.6:  Diagram  of  the  committee  machine 

number  of  weight  vectors  has  to  be  known  in  advance.  The  best  known  logics 
are  probably  the  majority  logic  and  the  veto  logic.  The  majority  logic  assigns  a 
sample  x  to  u;x  if 
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,(x)>^p-  (2.10) 

i=i  z 

and  the  veto  logic  assigns  a  sample  to  if 

f>.(x)  =  n.  (2.11) 

i=i 

The  problem  is  to  determine  the  weight  vectors  ai,  •  •  • ,  an.  Takiyama  [34]  has 
proposed  an  algorithm  which  try  to  optimize  a  modified  perceptron  criterion 
function  with  a  gradient  descent  procedure. 

A  general  disadvantage  with  the  group  a)  -  approaches  is,  as  already 
mentioned,  that  the  number  of  weight  vectors  has  to  be  determined  in  advance. 
Furthermore,  in  Chang’s  and  Takiyama’s  proposals,  only  the  two-class  problem 
has  been  considered. 

Mangasarian  [25],  Mizouguchi  et  al  [26]  and  Lee  and  Richards  [24]  have 
developed  group  b)  -  algorithms  for  the  two-class  problem.  We  will  briefly 
describe  their  methods  in  the  following. 

Mangasarian  [25]  proposed  an  algorithm  for  using  piecewise  linear  decision 
boundaries.  The  method  is  illustrated  in  figure  2.7  and  figure  2.8.  Linear 
programming  minimizing  the  perceptron  criterion  function  is  used  for  dividing 
the  feature  space  into  three  regions  by  two  parallel  hyperplanes.  The  region  to 
the  positive  side  of  both  hyperplanes,  {y  :  a^y  >  &},  will  contain  only  samples 
from  u>i.  The  region  to  the  negative  side  of  both  hyperplanes,  {y  :  a[y  <  ai}, 
(we  assume  a,  <  $  Vi),  contains  samples  from  w2  only.  The  “confusion”  region 
between  the  two  hyperplanes  may  contain  samples  from  both  classes.  The 
algorithm  proceeds  by  applying  the  same  procedure  to  the  samples  in  the 
confusion  region.  This  results  in  a  confusion  region  within  the  confusion  region, 
and  so  forth.  The  process  terminates  when  the  confusion  region  is  empty.  In 
other  words,  the  termination  takes  place  when  all  samples  in  the  design  set  are 
correctly  classified.  Alternatively  one  may  simply  terminate  the  algorithm  after 
a  predetermined  number  of  iterations,  leaving  a  region  with  unclassified 
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samples.  The  decision  boundary  of  the  algorithm  is  illustrated  in  figure  2.8  for 
the  situation  where  separation  between  the  classes  is  obtained  by  using  two 
weight  vectors. 

The  piecewise  linear  classifier  of  Mizoguchi  et  al  [26]  is  based  on  combining 
linear  discriminant  functions  in  a  tree  design.  First  the  feature  space  is  divided 
into  two  regions  by  using  a  linear  discriminant  function.  Next,  if  the  classes  are 
not  linearly  separable,  each  of  the  two  regions  is  divided  into  two  subregions 
with  linear  discriminant  functions.  This  process  is  repeated  until  all  samples 
within  a  subregion  belong  to  the  same  class.  An  example  of  a  piecewise  linear 
classifier  constructed  in  this  way  is  given  in  figure  2.9. 


Figure  2.9:  Example  of  a  piecewise  linear  classifier  constructed  with  the  algorithm 
of  Mizoguchi  et  al.  The  corresponding  tree  structure  is  also  shown. 

Lee  and  Richards  [24]  have  developed  a  piecewise  linear  classifier  based  on  the 
concept  of  single  sided  decision  hyperplane.  This  is  a  hyperplane  Hy 

a‘y  =  0  which  has  samples  of  one  class  only  on  the  positive  side  of  the  plane 
(a‘y  >  0).  On  the  negative  side,  samples  from  several  classes  may  be  present. 
Initially,  a  single  sided  decision  hyperplane  perpendicular  to  ai  is  computed. 
Then,  the  class  of  the  samples  on  the  positive  side  of  the  hyperplane  is 
identified.  Now,  if  samples  from  both  classes  are  located  on  the  negative  side  of 


the  hyperplane,  then  a  new  single  sided  decision  hyperplane  is  created  in  this 
sub  space.  This  process  continues  until  samples  from  one  class  only  are  found  on 
the  negative  side  of  the  last  constructed  hyperplane.  When  the  algorithm  has 
terminated,  we  know  that  the  samples  in  the  design  set  is  correctly  classified. 

This  classifier  may  also  be  considered  as  a  committee  machine  using  seniority 
logic.  In  majority  and  veto  logics,  all  the  committee  members  have  the  same 
influence.  This  is  not  true  for  the  seniority  logic  where  some  members  are  more 
influential  than  others.  The  influence  and  the  number  of  committee  members 
are  determined  during  the  training  process. 

The  algorithm  is  illustrated  in  figure  2.10.  A,(x)  is  defined  in  figure  2.6.  We 


Figure  2.10:  Examples  of  classification  with  Lee  and  Richards  algorithm. 

may  also  compute  all  /i;(x)  and  then  use  a  logical  network  for  the  classification. 
The  network’s  truth  table  of  the  classifier  in  figure  2.10  is  given  in  table  2.1.  X 
denotes  a  “don’t  care”  state  of  a  committee  member. 


Hoffman  and  Moe  [19]  have  developed  an  interesting  two-class  algorithm  which 
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Table  2.1:  Truth  table  for  the  decision  logic  for  the  classifier  used  in  figure  2.10. 

does  not  fit  into  any  of  the  two  above  mentioned  groups  of  algorithms.  They 
constructed  a  linear  discriminant  function  and  created  a  confusion  region 
(a  <  a'y  <  (3)  in  the  same  way  as  Mangasarian  [25].  But,  opposed  to  him,  they 
clustered  the  samples  from  each  class  in  the  confusion  region  into  several 
clusters  according  to  the  following  procedure.  First  the  samples  (from  each 
class)  are  divided  into  two  clusters.  Next,  the  clusters,  which  are  sufficiently 
“large”  (with  respect  to  the  within-class  scatter  matrix),  are  divided  further. 
This  process  continues  until  all  clusters  are  “equal”  and  sufficiently  “small”  (also 
with  respect  to  the  within-class  scatter  matrix).  Finally,  they  computed  the 
sample  mean  of  each  cluster.  The  classifier  assigns  a  sample  y  to  u>2  if  a‘y  <  a, 
to  u?]  if  a‘y  >  (3  and  to  the  class  with  nearest  cluster  mean  if  a  <  a‘y  <  /?. 

As  we  have  already  mentioned,  all  algorithms- we  have  described  except  the 
classifier  of  Duda  and  Fossum  [7]  are  developed  for  the  two-class  problem  only. 
If  we  want  to  classify  an  object  to  one  out  of  c  classes,  c  >  2,  this  may  be  done 
by  creating  two  main  class  groups  where  several  classes  are  merged  together. 
While  classifying,  a  sample  is  classified  through  different  class  groups  until  it  is 
assigned  to  one  of  the  c  classes.  Unfortunately  this  process  makes  the  classifier 
much  more  complicated  for  the  c-class  problem  than  e.g.  the  quadratic 
classifier.  An  example  where  c  =  4  is  shown  in  figure  2.11.  and  u>2  is  merged 
together  to  one  class  group,  w3  and  to  another.  After  first  being  assigned  to 
either  the  class  group  U  u>2  or  the  class  group  w3  U  u>4,  a  sample  y  can  be 
assigned  to  one  of  the  4  classes. 

If  samples  from  the  different  classes  are  partly  overlapping  each  other,  no  simple 
and  reliable  classifier  is  available.  It  is  difficult  to  have  any  general  knowledge 
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Figure  2.11:  Illustration  of  how  to  use  a  two-class  classifier  for  the  c-class  prob¬ 
lem. 

about  the  performance  of  the  group  a)-algorithms.  Furthermore,  one  may  easily 
be  convinced  that  the  group  b)-algorithms  tend  to  be  quite  complicated.  This 
is  of  course  unconvenient  from  a  computational  point  of  view.  It  is  also 
important  to  remember  that  a  correctly  classified  design  set  does  not  in  general 
imply  low  error  rate  and  hence  a  reliable  classifier. 

Therefore,  one  objective  is  to  develope  a  classifier  which  attempts  to  fulfill  the 
following  requirements: 


-  It  must  be  able  to  handle  the  c-class  problem  easily. 

-  A  low  error  rate  should  be  achieved  with  a  relatively  small  number  of 
(linear)  discriminant  functions  rather  than  resulting  in  a  correctly 
classified  design  set. 


In  the  next  chapter  we  will  describe  a  method  which  attempts  to  accomplish 
these  requirements. 
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3  THE  NEW  CONCEPT 

In  this  chapter  an  introduction  to  a  new  concept  for  constructing  a  piecewise 
linear  classifier  is  presented  (for  a  brief  presentation  of  the  method,  see  also 
[29]).  First  the  sample  space  of  each  class  may  be  partitioned  (split)  into  several 
(>  1)  subsample  spaces.  The  samples  from  a  given  class  laying  in  the  same 
subsample  space  are  said  to  belong  to  the  same  subclass.  The  partitioning  is 
done  in  order  to  make  the  classifier  more  adaptable  to  the  classes.  Next,  the 
generated  subclass  information  is  used  for  constructing  the  piecewise  linear 
discriminant  function.  Each  subclass  is  represented  by  one  linear  discriminant 
function  (one  weight  vector).  However,  since  the  distributions  of  the  classes 
usually  are  unknown,  estimation  techniques  have  to  be  used.  As  an  example  the 
splitting  process  becomes  a  clustering  process  which  divides  the  samples  in  a 
given  (sub)class  into  clusters. 

The  concept  is  illustrated  by  the  following  “quasi-Pascal”  statements: 


n  :=  c;  finished:=FALSE; 

dnitially,  assign  one  subclass  to  each  class,  (a<  :=  ,  i  =  1,  •  •  • ,  c)  >; 

REPEAT 

<Construct  a  piecewise  linear  classifier  using  the  subclasses  alt  •  •  • ,  an  >; 
cCompute  an  evaluation  criterion  for  the  classifier>; 

IF  Sufficiently  good  performance>  THEN  finished:=TRUE 
ELSE  BEGIN 

Split  the  sample  space  of  one  of  the  subclasses  ai,  •  •  • ,  an  >; 

END; 

UNTIL  ((finished)  OR  (n  >  nmox)); 

<Generate  the  final  classified; 


The  algorithm  is  illustrated  in  figure  3.1.  Here  we  have  assumed  the  classes  to 
be  “banana”  distributed.  In  figure  3.1a,  a  linear  classifier  is  applied,  in 
figure  3.1b  the  sample  space  of  one  of  the  classes  is  split  into  two  subsample 
spaces  and  the  resulting  classifier  is  shown.  Finally  in  figure  3.1c  the  sample 
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space  of  both  classes  are  split  in  two  subsample  spaces  each  and  the  resulting 
classifier  is  shown. 


Figure  3.1:  Illustration  of  how  the  algorithm  works. 

Although  the  basic  idea  is  simple,  there  are  several  difficulties  which  have  to  be 
solved.  The  most  important  decisions  to  make  are: 


-  which  (sub)sample  space  to  split, 

-  how  to  split  a  given  (sub)sample  space, 

-  how  to  use  the  subclass  information  for  generating  a  piecewise  linear 
discriminant  function, 

-  how  to  terminate  the  splitting. 


The  algorithms  making  these  decisions  are  developed  in  the  next  chapter.  In  the 
following,  a  brief  description  of  the  main  ideas  is  given. 


3.1  Finding  a  suitable  (sub)sample  space  for  splitting 

The  first  problem  we  will  study  is  how  to  find  the  best  (sub)sample  space  to 
split.  This  is  very  important  because  we  want  to  make  the  classifier  as 
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adaptable  as  possible  to  the  classes  with  only  a  limited  number  of  linear 
discriminant  functions.  Two  different  splitting  strategies  have  been  developed. 

The  first  strategy  is  a  hillclimbing  approach.  This  algorithm  proceeds  through  a 
sequence  of  essentially  equal  steps.  Assume  that  there  are  n  subclasses.  Now, 
we  will  in  turn  pick  one  of  the  n  subclasses  and 


-  split  the  sample  space, 

-  generate  a  classifier  for  the  particular  set  of  subclasses, 

-  evaluate  the  classifier. 


The  subclass  (when  split)  which  result  in  a  classifier  having  the  best 
performance  is  chosen.  We  see  that  we  will  always  split  the  “best”  (sub)sample 
space.  However,  this  approach  requires  a  large  amount  of  computation. 

The  procedure  may  also  be  illustrated  with  the  following  “quasi-Pascal” 
statements,  where  n  is  the  number  of  subclasses  and  a,  denotes  the  :’th  subclass. 


FOR  i  :=  1  TO  n  DO 
BEGIN 

<Split  the  sample  space  of  a,  >; 

<Generate  a  piecewise  linear  classifier  using  ,  •  •  • ,  a,, ,  a,3 ,  ■  •  • ,  an  > ; 
<  Compute  the  error  rate  Pe  >; 

IF(Pe<JPemm)THEN 
BEGIN 
k  :=  i; 

P  •=  P  ■ 

END; 

< Merge  the  sample  spaces  of  a,,  and  a, 2  >; 

END; 

<SpIit  the  sample  space  of  a*  >; 
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As  seen,  this  approach  is  very  time  consuming.  Therefore,  an  other  approach 
has  been  considered  for  situations  where  the  design  time  is  important. 

It  is  reasonable  to  assume  that  the  classifier  need  to  be  more  sensitive  with 
respect  to  the  (sub)class  having  the  greatest  contribution  to  the  (estimated) 
error  rate.  A  performance  improvement  is  hopefully  achieved  by  splitting  the 
sample  space  of  this  (sub)class.  It  is  however  based  on  the  assumption  that  the 
error  rate  of  a  given  (sub)class  is  caused  by  either  multimodality  or  concavity 
and  not  by  overlap  with  other  (sub)classes.  The  procedure  is  illustrated  in  the 
following  statements: 


<Generate  a  piecewise  linear  classifier  using  the  (sub)classes  ai,  •  •  • ,  an  >; 
<Find  the  (sub)class  a*  contributing  most  to  Pe  > 

<SpIit  the  sample  space  of  (sub)class  a*  >; 


3.2  How  to  split  a  given  subsample  space 

The  purpose  of  the  splitting  module  is  to  divide  the  sample  space  of  a  given 
(sub)class  into  two  new  disjoint  sample  spaces  which  are  as  “much  separatedr 
as  possible.  This  process  becomes  a  clustering  procedure  because  no  a  priori 
information  concerning  the  distributions  and  the  sample  spaces  is  assumed.  The 
clustering  is  a  quite  difficult  task  because  in  a  particular  splitting  there  are  far 
more  possible  partitions  than  can  be  evaluated.  As  an  example,  it  is  easily 
shown  that  there  are  approximately  lO30  possible  allocations  of  100  samples  into 
two  clusters.  Thus  it  is  necessary  to  find  a  more  efficient  search  method  than 
the  exhaustive  search. 

To  achieve  our  goal  we  choose  the  method  of  limiting  the  search  space.  The 
algorithm  consists  of  two  parts. 

First  a  coarse  clustering  is  performed,  followed  by  an  optimization  procedure 
based  on  the  initial  clustering. 


Two  strategies  for  the  coarse  clustering  are  developed.  The  simplest  one  divides 


the  data  set  along  the  direction  of  maximum  variance.  This  is  done  by  finding 
the  principal  axis  of  the  data  set.  Hopefully  the  clusters  are  well  separated. 

The  other  strategy  is  somewhat  more  complicated.  First  a  pair  of  widely 
separated  samples  is  identified  and  used  as  starting  samples  (one  for  each 
cluster).  Then  each  sample  in  the  data  set  is  assigned  to  one  of  the  two  clusters 
according  to  a  given  rule. 

The  results  of  these  coarse  clusterings  are  obviously  not  optimal  with  respect  to 
any  criterion.  Therefore  an  optimization  algorithm  may  be  applied. 

This  optimization  algorithm  is  based  on  evolutionary  search,  which 

# 

unfortunately  is  a  suboptimal  procedure.  This  is  because  it  is  impossible  to  use 
an  optimal  branch  and  bound  algorithm  [23]  together  with  the  chosen  criterion 
function.  The  branch  and  bound  algorithm  does  also  require  a  great  amount  of 
computer  power  if  there  are  many  samples  in  the  data  set  to  be  split. 

The  evolutionary  search  uses  the  result  from  the  previous  step  as  an  initial 
clustering.  Next  each  sample  is  in  turn  considered  as  a  candidate  for 
reallocation.  If  reallocation  results  in  an  improved  value  of  the  criterion 
function,  the  sample  is  transferred.  Otherwise  it  remains  in  its  original  cluster. 
When  all  the  samples  in  the  data  set  have  been  investigated,  one  iteration  is 
finished.  The  algorithm  terminates  when  no  samples  are  transferred  during  an 
iteration. 
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3.3  Generation  of  piecewise  linear  classifiers 

Given  a  set  of  n  subclasses,  a  piecewise  linear  discriminant  function  has  to  be 
generated. 

The  simplest  way  of  generating  a  piecewise  linear  discriminant  function  is  to  use 
the  nearest  sub-mean  classifier.  This  classifier  assigns  a  sample  to  the  class  with 
the  nearest  sub-mean.  It  is  reasonable  to  assume  that  it  has  only  limited  ability 
to  be  adapted  to  the  data  set  satisfactorily.  Therefore  other  -  and  more 
sophisticated  -  methods  are  introduced. 

Two  of  them  are  based  on  the  mean  squared  error  approach.  In  this  approach 

9 

weight  vectors  are  produced  which  map  the  data  set  into  a  prespecified  set  of 
points  with  minimum  squared  error.  It  was  initially  assumed  that  classifiers 
based  on  the  mean  squared  error  approach  would  perform  well  because  these 
classifiers  are  minimum  mean  squared  error  approximations  to  the  Bayes 
classifier. 

We  will  also  introduce  a  new  approach.  The  aim  here  is  to  generate  a  piecewise 
linear  classifier  based  on  separating  certain  pairs  of  (sub)classes  by  a 
hyperplane.  First  the  algorithm  determines  the  pairs  of  (sub)classes  to  be 
separated.  Let  us  for  a  moment  assume  the  density  of  each  class  to  be  known. 
Then  two  (sub)classes  are  assumed  to  need  separation  if  the  optimal 
hypersurface  is  intersecting  the  line  going  between  the  means  of  the 
(sub)classes.  However,  as  long  as  these  densities  are  unknown,  these  pairs  are 
determined  by  using  estimation  techniques.  Next,  we  have  to  design  a  linear 
classifier  for  each  pair  of  (sub)classes.  Finally,  a  piecewise  linear  classifier  based 
on  the  linear  classifiers  is  computed.  This  is  done  by  using  the  generalized 
inverse,  which  is  computed  with  the  singular  value  decomposition  method. 
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3.4  Termination  of  the  algorithm  and  generation  of  the  final 
classifier 

The  last  operation  to  be  done  is  to  terminate  the  splitting  process  and  generate 
a  final  classifier. 

We  first  define  the  maximum  number  of  subclasses  of  the  classifier  (maximum 
number  of  discriminant  functions).  Afterwards  the  splitting  process  continues 
until 

a)  a  maximum  number  of  subclasses  is  reached, 

b)  no  subclass  can  be  split  further, 

c)  the  maximum  contribution  to  the  error  rate  (from  one  subclass)  is 
sufficiently  small. 


The  first  criterion  terminates  the  splitting  when  the  maximum  number  of 
discriminant  functions  of  the  classifier  is  reached,  and  the  second  one  is  used  if 
the  number  of  samples  in  any  subclass  is  too  low.  The  third  criterion  terminates 
the  splitting  process  if  it  is  reasonable  to  belive  that  little  reduction  in  the 
(design  set  based)  error  rate  estimate  is  gained  by  further  splitting. 

Among  a  set  of  classifiers  (based  on  different  number  of  subclasses),  we  are  to 
find  a  classifier  with  sufficiently  good  performance.  Two  ways  to  do  this  are 
suggested. 

The  first  one  is  simply  to  find  the  classifier  minimizing  a  given  criterion 
function.  The  criterion  function  proposed  in  this  work  combines  the  error  rate 
and  the  number  of  linear  classifiers. 


The  second  method  selects  the  classifier  having  the  smallest  number  of  weight 
vectors  among  those  which  are  sufficiently  close  to  the  classifier  with  the  lowest 
design  set  based  error  rate  estimate. 
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4  DEVELOPMENT  OF  THE  ALGORITHMS 

4.1  Determination  of  a  suitable  subsample  space  to  split 

In  chapter  3  we  stated  that  the  sample  space  of  each  class  (the  space  for  which 
p(x|u>,)  >  0)  was  partitioned  into  disjoint  subsample  spaces  in  order  to  make  the 
classifier  more  adaptable  to  the  classes.  This  process  is  iterative.  One 
(sub)sample  space  is  being  split  during  one  iteration,  and  it  is  never  later 
merged  together  with  other  (sub)sample  spaces.  Therefore  it  is  of  considerable 
importance  to  determine  the  right  one. 

In  the  previous  chapter,  we  suggested  two  different  strategies  for  selecting  a 
(sub)sample  space  for  splitting.  The  first  one  was  a  hillclimbing  approach, 
which  found  the  “best”  (sub)sample  space  to  split.  The  other  one  split  the 
sample  space  of  the  (sub)class  contributing  most  to  the  error  rate.  These 
strategies  will  be  described  in  section  4.1.1  and  4.1.2  respectively.  However,  we 
will  start  with  defining  the  (sub)sample  spaces  and  deriving  simple  estimators 
for  the  error  rate  of  a  (sub)class. 

At  some  point  in  the  process,  let  us  assume  that  we  have  split  the  sample  spaces 
of  the  c  classes  into  n  subsample  spaces  and  that  the  class  u consists  of  n, 
disjoint  subsample  spaces.  In  other  words 


n  =  n<  • 

t=i 

Furthermore,  let  5,-  denote  the  sample  space  of  u;,  and  let  S,,  denote  the  j’th 
subsample  space  of  u>i.  The  subsample  spaces  of  each  class  are  disjoint  which 
implies 


Si,  n  Sik  =  0,  j  ^  k 


and 
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Si  =  5,j  U  5j2  U  •  ■  •  U  Si 


We  will  in  this  section  assume  the  n  subsample  spaces  to  be  known  a  priori. 
Finally,  let  /ft  denote  the  statement  “The  sample  x  € 

The  error  rate  of  a  classifier,  also  called  the  probability  of  misclassifying 
samples,  is  given  as 


P(e)  =  £PWP(+«)  (4.1) 

1=1 

where  P(e)  is  the  overall  probability  of  misclassifying  samples  (the  error  rate), 
P(a>j)  is  the  a  priori  probability  of  u;,-  and  P(e|uj;)  is  the  conditional  probability 
of  misclassifying  samples  from  u>,-.  However,  when  computing  the  error  rate,  we 
need  to  know  the  conditional  probability  density  p(x|u;;)  V  i.  Usually  these 
densities  are  not  available.  Therefore  we  have  to  use  estimation  techniques  for 
these  tasks.  The  estimation  is  based  on  a  data  set  consisting  of  samples 
(realisations)  drawn  independently  according  to  the  class  densities 
p(x|a\),  i  =  1,  •••,<:. 

Now,  assume  that  a  classifier  is  available.  Using 

P(eM  =  J-,,  (4-2) 

we  obtain 

£(«)-EP(<*)4  (4-3) 

1=1  iV« 

where  /,  is  the  number  of  misclassifications  and  Ni  denotes  the  number  of 
samples  representing  u 

If  desirable,  we  may  also  use  the  subclass  information  for  obtaining  the  error 

Nt  J% 

rate.  This  is  easily  seen  by  using  and  ■£-  as  estimators  for  P (/?,-;  |u;i)  and 
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P(e  |  respectively.  N{}  and  fij  denote  respectively  the  number  of  samples 

and  the  number  of  misclassifications  from  the  samples  in  the  j’th  subclass  of 

As  a  special  case,  we  see  that  P(e)  =  -fe  if  P(u;;)  =  $  where  /  and  N  denote  the 
total  number  of  misclassifications  and  the  total  number  of  samples  in  the  data 
set. 

We  will  also  in  the  following  use  a  performance  measure  Pp,  defined  as 
Pp  =  1  —  P„,  where  Pe  denotes  the  design  set  based  error  rate  estimate.  This 
performance  measure  is  useful  for  measuring  how  well  a  classifier  is  adapted  to 
the  data  set. 

Now,  when  the  error  rates  and  their  estimates  have  been  stated,  we  will  study 
how  this  information  may  be  used  for  determining  which  subsample  space  to 
split  in  the  next  iteration. 

4.1.1  The  hillclimbing  approach 

This  algorithm  proceeds  through  a  sequence  of  essentially  equal  steps.  Assume 
that  n  subsample  spaces  are  available  at  a  certain  step.  Then,  pick  one  of  the  n 
subsample  spaces  in  turn,  split,  generate  a  classifier  and  evaluate  the  result. 

The  split,  resulting  in  a  classifier  having  the  best  performance  is  chosen,  and  the 
next  step  involving  n  +  1  subclasses  is  initiated. 

As  in  chapter  3  we  define: 


ax  =  x  belongs  to  u>x  and  x  6  5i, 

ani  =  x  belongs  to  u>i  and  x  6  SXni  (4.4) 

an,+1  =  x  belongs  to  u>2  and  x  €  S2i 

an  =  x  belongs  to  u>c  and  x  €  SCnc 
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The  approach  may  be  illustrated  iu  the  following  statements: 


Ppm«:=0.0; 

FOR  t:=l  TO  n  DO 
BEGIN 

<  Split  the  sample  space  of  c*j  >; 

<Generate  a  PLC  using  the  subclasses  Qi,  •  •  • ,  a,, ,  •  •  • ,  an  >', 

<Compute  the  performance  Pp  >; 

IF  (P„  >  PPm<J  THEN 
BEGIN 
k:=i] 

Ppma*  :==  Fpi 
END; 

<  Merge  the  sample  spaces  of  a,,  and  q,2  >; 

END; 

<Split  the  sample  space  of  ak  >; 


The  performance  is  computed  according  to  eq  4.3.  This  algorithm  is  quite  time 
consuming  because  in  a  given  step  each  subsample  space  has  to  be  split,  a 
classifier  generated  and  the  result  evaluated. 

Even  though  the  computation  time  required  for  the  design  phase  is  not 
important  in  most  cases,  it  is  of  interest  to  investigate  whether  or  not  it  is 
possible  to  reduce  the  design  time  without  reducing  reliability.  Therefore 
another  algorithm  is  described  in  the  following. 

4.1.2  A  contribution  based  splitting 

This  algorithm  splits  and  evaluates  only  one  (sub)sample  space  in  each 
iteration.  We  must  therefore  try  to  select  the  split  which  is  most  likely  to 
improve  the  classification  result.  In  order  to  obtain  this,  it  is  reasonable  to  split 
the  sample  space  of  the  (sub)class  with  the  largest  contribution  to  the  design 
set  based  error  rate  Pe.  This  is  based  on  the  assumption  that  contribution  to  Pe 
from  a  subclasc  is  caused  by  concavity  and/or  multimodality  of  the  (sub)class 
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and  not  by  overlap  with  another  subclass.  The  procedure  may  be  illustrated 
with  the  following  statements: 


<Generate  a  PLC  using  the  subclasses  •  •  • ,  a„  >; 
<Find  the  subclass  a*  contributing  most  to  Pe  > 
<Split  the  sample  space  of  a*  >; 


We  will  now  search  for  the  subclass  with  the  largest  contribution  to  Pe.  Using 
previously  defined  notation,  this  is  the  subclass  Qjt  for  which 

P(e,at)  =  max|p(^(j))^^J  (4.5) 

where  a,  is  defined  according  to  4.4.  Moreover,  g(j)  finds  the  class  which  the 
subclass  aj  belongs  to,  and  h(j)  returns  the  subclass  number  of  aj  within  u3(j). 

In  the  two  previous  sections,  we  have  developed  two  different  splitting 
strategies.  For  being  able  to  evaluate  these  different  algorithms,  several 
simulation  experiments  have  been  carried  out. 

4.1.3  Comparison  of  the  two  strategies 

The  only  way  of  compaxing  these  two  strategies  is  through  Monte  Carlo 
simulations.  In  this  context  Monte  Carlo  simulations  consists  of  first  generating 
data  sets  for  the  different  classes.  Then  for  each  subclass  selection  strategy, 
splitting  algorithm  and  classifier  algorithm,  the  performance  is  computed.  This 
process  is  called  a  replication,  and  it  is  repeated  a  number  of  times  in  order  to 
obtain  reliable  statistics  on  the  performance.  The  mean  of  the  performance  and 
the  corresponding  standard  deviation  are  used  as  statistics  in  this  evaluation. 
The  standard  deviation  of  the  performance  may  be  viewed  as  a  stability 
criterion.  If  the  standard  deviation  is  high,  it  may  be  assumed  that  the  splitting 
performance  depends  very  much  on  actual  realizations,  and  hence  the  algorithm 
may  be  unreliable. 


The  disadvantage  of  using  Monte  Carlo  simulations  is  that  we  cannot  draw 
general  conclusions.  However,  by  using  many  different  probability  distributions 
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and  number  of  samples  in  the  data  sets,  the  evaluation  should  fairly  well 
represent  the  algorithms  capability.  In  the  simulations  we  used  classes 
containing  Gaussian  distributed,  double  exponentially  distributed,  “banana” 
distributed  (to  be  defined  below)  and  “bimodai  Gaussian”  distributed  samples. 
Moreover,  in  the  simulations  using  Gaussian  or  double  exponential  distributed 
samples,  different  relative  positions  of  the  distributions  were  evaluated,  see 
figure  4.1.  Simulations  were  made  for  7  =  k  x.  10°,  k  =  0, 1,  •  •  • , 9.  For  each 


Figure  4.1:  Distribution  of  Gaussian  and  double  exponential  samples, 
angle  7,  4  different  distances  r7  were  used.  In  all  simulations  we  have  used 


a=(Ai  °)  =  (9  °) 

V  °  -W  \  1  1  / 

In  order  to  have  a  “rotation  invariant”  distance  r7,  we  defined 


r~,  -  d(cr-y  +  \i) 
where 


<ry  =  [cos(7),sin(7)] 


Ax  0 
0  A2 


cos(7) 

sin(7) 
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This  means  that  <r7  denotes  the  standard  deviation  of  the  samples  from  cjj 
projected  into  the  vector  [cos(7),sin(7)]‘.  When  and  7  are  known,  p2  can  be 
determined  as 


/  A‘i(l)  +  r^cos(7)  \ 
\  Pi  (2)  +r7sin(7)  / 


By  using  this  way  of  defining  the  distance  between  /ij  and  /*2,  the  overlap  of 
samples  between  uj\  and  u;2  is,  for  a  given  d,  almost  independent  of  the  angle  7. 
Simulations  are  made  using  d  =  0.5, 1.0, 1.5  and  2.0. 

Simulations  have  also  been  carried  out  with  samples  from  07  taken  from  a 
“banana”  distribution,  and  the  samples  from  u?  taken  from  a  Gaussian 
distribution.  This  is  shown  in  figure  4.2.  In  these  simulations,  the  u;2  samples 


ro - -H-s-H 


Figure  4.2:  Distribution  of  banana  and  Gaussian  distribution. 

are  taken  from  a  N( 0, 3  I)  distribution.  The  banana  distribution  is  defined 
according  to  figure  4.3.  In  polar  coordinates,  using  /x  as  origo,  the  banana 
distribution  has  a  density 


p(r,9)  =  p(9)p(r\9 ) 
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Figure  4.3:  The  banana  distribution, 
where 


?(*)  = 


-£l*-0l  +  £  >  l«-e|<f 

0  ,  otherwise 


p(r|0)  =  1V(5,<t(0))  (r0  =  5) 


and 


cr(d)  =  1  -  — 1<9  —  ©|  . 

7T 

Due  to  symmetry,  the  results  are  independent  of  0.  Simulations  are  made  for  6 
different  values  of  s  (s  =  k  =  1,  •  •  • ,  6)  to  investigate  various  degrees  of 
overlap. 

Finally,  a  3  class  problem  was  simulated  in  order  to  demonstrate  the  multi-class 
properties  of  the  classifiers,  as  well  as  to  see  how  the  methods  handle 
multimodale  distributions.  The  distributions  were  chosen  as  follows: 
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p(xM  - N  ((!)'s(  o  #)) 

p(x|w3)  =  N  (  (  3  )  ’ s  I ) 

Simulations  using  s  =  0.5,  1.0,  1.5  and  2.0  have  been  made. 

All  classifiers  to  be  described  in  section  4.3  were  used  in  the  simulations. 
Moreover,  there  were  25  samples  in  the  data  set  of  each  class.  In  some  of  the 
simulations  we  have  also  used  150  samples  in  the  data  set  of  each  class 
(Gaussian  and  double  exponential  distributions  using  7  =  30°  and  7  =  60°,  the 
banana/Gaussian  distributions  and  the  three  class  problem).  The  reason  for 
mostly  using  small  data  sets  is  that  the  computation  time  increases  considerably 
as  the  number  of  samples  increases.  Equal  a  priori  probabilities  have  also  been 
assumed  everywhere  in  these  simulations,  and  only  two-dimensional  samples 
have  been  used.  These  restrictions  are  caused  by  the  limitations  imposed  by  the 
high  computationally  costs  of  the  simulations.  Therefore,  we  prefer  to  study 
situations  in  reasonable  detail  rather  than  doing  exploratory  analysis  of  a  larger 
but  less  well  defined  classes  of  situations.  Moreover,  we  have  no  indications  that 
the  algorithms  should  behave  essentially  different  in  a  higher  dimensional  space, 
or  in  situations  with  unequal  a  priori  probabilities. 


It  is  unfortunately  impossible  to  present  all  results  from  the  simulations. 
Therefore  we  will  only  illustrate  the  main  results  using  examples  and  give  some 
general  remarks.  However,  all  results  are  available  on  request. 
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Before  discussing  our  observations,  we  present  some  examples  of  the  simulation 
results.  Figure  4.4  shows  the  mean  performance  of  simulations  using  double 
exponential  distributions  and  the  classifier  described  in  4.3.4.  In  figure  4.5  we 
have  plotted  the  results  from  the  simulations  where  the  classes  are  banana  and 
Gaussian  distributed.  Furthermore,  the  results  are  also  tabulated  in 
tables  4. 1-4.4.  Both  mean  and  standard  deviation  of  Pe  are  given. 


d  «  0.5 


taken  from  double  exponential  distributions,  7  =  30°.  Plots  are  made 
for  both  splitting  strategies  with  <f=0.5,  1.0  and  1.5. 


As  expected  we  found  the  hillclimbing  approach  to  be  clearly  better  than  the 
contribution  based  splitting.  At  least  10%  -  20%  better  performance  is  achieved 
by  using  the  hillclimbing  approach.  When  using  the  contribution  based 
splitting,  the  standard  deviation  of  the  performance  increases  rapidly  as  the 
number  of  subclasses  increases.  These  observations  should  indicate  that  this 
procedure  is  somewhat  unstable  and  hence  unreliable.  However,  this  simplified 
approach  may  perform  satisfactorily  if  the  main  reason  for  errors  is  caused  by 
multiple  mode  and/or  concavities.  This  fact  is  seen  from  the  simulations  using 
classes  with  Gaussian  and  banana  distributed  samples  with  small  values  of  s. 
However,  when  there  are  moderate  or  more  overlap  between  the  classes,  the 
contribution  based  splitting  performs  poorly.  In  both  approaches,  the 
performance  do  also  tend  to  be  poor  and  unreliable  when  the  number  of 
samples  in  the  subclasses  are  small.  This  is  for  instance  seen  in  the  simulations 
using  25  samples  in  each  class  and  having  6-8  subclasses.  In  these  simulations. 
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Mo.  aubel.  Mo.  •• 


Figure  4.5:  Design  set  based  error  rate  estimates  using  25  (a)  and  150  (b)  samples 
taken  from  banana  and  Gaussian  distributions.  Plots  are  made  for 
both  splitting  strategies  with  s=1.0,  2.0  and  3.0. 

the  standard  deviation  becomes  quite  high. 

To  conclude  this  discussion,  we  found  the  hillclimbing  approach  to  be  clearly 
superior  to  the  contribution  based  splitting.  Thus,  it  should  be  used  in  all 
situations  for  which  the  design  time  is  of  minor  importance. 
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Spl. 

H 

No.  of  subclasses 

II 

2 

3 

4 

5 

6 

7 

8 

4.1.2 

0.5 

0.178 

Bill 

r 

0.124 

mmi 

0.049 

4.1.2 

1.0 

IMS 

0.060 

0.052 

jam 

0.062 

mm 

0.031 

0.028 

BUI 

0.052 

4.1.2 

1.5 

0.021 

0.018 

0.018 

H 

0.022 

0.029 

0.021 

0.018 

0.020 

B 

0.031 

0.044 

4.1.2 

2.0 

0.007 

0.006 

mm 

0.007 

QS 

0.014 

0.014 

EES 

0.018 
— - - - 

4.1.3 

0.5 

0.164 

0.152 

mm 

0.140 

am 

0.104 

mil 

0.047 

0.049 

WM 

0.122 

B 

0.153 

4.1.3 

1.0 

0.065 

mm 

0.085 

mm 

0.068 

0.059 

0.033 

0.113 

B 

0.149 

0.149 

4.1.3 

1.5 

mm 

0.043 

0.034 

0.022 

\mm 

0.099 

0.102 

0.095 

4.1.3 

2.0 

0.006 

0.009 

0.018 

0.016 

ma 

0.012 

0.010 

0.011 

0.024 

0.066 

0.069 

B 

0.067 

0.066 

Table  4.1:  Results  using  25  samples  taken  from  double  exponential  distributions, 
7  =  30°.  Both  mean  and  standard  deviation  of  the  design  set  based 
error  rate  estimate  are  tabulated  for  each  method/distance. 
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Spl. 

El 

No.  of  subclasses 

1 

2 

3 

4 

5 

6 

7 

8 

4.1.2 

0.5 

psi| 

B 

0.178 

0.177 

EBi 

HI 

0.025 

0.026 

4.1.2 

1.0 

0.087 

0.080 

0.072 

0.071 

0.071 

0.015 

0.016 

lifU 

0.015 

0.016 

0.016 

4.1.2 

1.5 

Eg 

0.027 

B59 

0.027 

0.027 

EH 

B 

0.009 

0.009 

0.009 

4.1.2 

2.0 

0.011 

0.010 

0.010 

0.010 

0.010 

EH 

0.007 

DH 

0.007 

0.007 

0.007 

0.007 

4.1.3 

0.5 

B 

B 

B 

Eg 

gi 

Eg 

0.205 

B 

B 

B 

B 

EH 

IHPEE1 

0.046 

4.1.3 

1.0 

0.087 

0.087 

B 

lllll 

0.086 

0.105 

0.113 

0.015 

0.018 

B 

HUM 

0.023 

0.081 

0.116 

4.1.3 

1.5 

B 

|PB 

0.033 

0.044 

EEI 

B 

0.051 

0.098 

4.1.3 

2.0 

0.012 

0.013 

B 

0.019 

0.022 

0.033 

0.035 

0.006 

0.007 

EH 

0.024 

0.054 

0.090 

0.103 

Table  4.2:  Results  using  150  samples  taken  from  double  exponential  distributions, 
7  =  30°.  Both  mean  and  standard  deviation  of  the  design  set  based 
error  rate  estimate  are  tabulated  for  each  method/distance. 


Table  4.3:  Results  using  25  samples  taken  from  Gaussian  and  banana  distribu¬ 
tions.  Both  mean  and  standard  deviation  of  the  design  set  based  error 
rate  estimate  are  tabulated  for  each  method/distance. 


1.0  0.078 
0.014 


36  0.000 
09  0.000 


2.0  0.147  0.067 
0.019  0.013 


00  0.000  0.000  0.000  0.000 

00  0.000  0.000  0.000  0.000 


2 

2 


0.055 

0.014 

0.055 

0.014 

0.097 

0.016 

0.097 

0.016 

4.1.3 

0.5 

4.1.3 

1.0 

4.1.3 

1.5 

4.1.3 

2.0 

4.1.3 

2.5 

4.1.3 

3.0 

0.014 


2.0  0.146  0.110 
0.019  0.051 


0.172 

0.020 


0.116 

0.078 


88  0.178  0.174  0.169  0.167 
58  0.053  0.053  0.061  0.062 


.207 

0.198 

0.190 

.034 

0.041 

0.039 

Table  4. 


4:  Results  using  150  samples  taken  from  Gaussian  and  banana  distribu¬ 
tions.  Both  mean  and  standard  deviation  of  the  design  set  based  error 
rate  estimate  are  tabulated  for  each  method/distance. 
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4.2  How  to  split  a  given  subsample  space 

The  purpose  of  the  splitting  process  is  to  make  the  classifier  more  adaptive  to 
the  classes.  If  a  (sub)class  to  split  is  known,  the  splitting  algorithm  attempts  to 
divide  the  given  (sub)sample  space  into  two  disjoint  parts  which  is  as  much 
separated  as  possible.  Thus,  our  problem  is  to  divide  the  sample  space  S  into 
two  sample  spaces  S i  and  S2  (5  =  Si  U  S2,  Si  H  S2  =  0)  such  that  a  given 
criterion  function  is  optimized.  As  an  example,  we  found  that  maximizing  the 
distance  from  the  mean  of  the  original  (sub)class  to  the  mean  of  the  nearest 
new  subclass  gives  good  results.  Intuitively,  this  is  also  a  reasonable  criterion  to 
use.  Due  to  unknown  probability  densities,  we  have  to  estimate  the  criterion 
function  using  the  data  set  X ,  which  consists  of  N  samples  representing  the 
(sub)class  to  split.  Assume  the  sample  spaces  S[  and  S2  are  to  be  evaluated, 
and  moreover  let  X{  =  {xi,,  •  •  • , }  denote  the  data  set  of  samples  for  which 
xtj  £  S[.  Furthermore,  X'2  =  {x2l ,  •  •  • ,  x2^  }  denotes  the  data  set  of  samples  for 
which  x2j  £  S'2.  Ni  denotes  the  number  of  samples  in  Xi,  i  =  1,2.  Then  an 
estimate  of  the  above  criterion  is 


•/(*;,*')  =  m|nj 


and  the  best  to  do  -  and  hence  our  aim  -  is  to  determine  the  data  sets  X\  and 
X2  for  which  J  is  optimized.  Because  Si  and  S2  cannot  be  uniquely  determined, 
we  are  choosing  one  of  the  possible  solutions,  for  example  the  one  for  which  the 
surface  separating  Si  and  S2  is  the  nearest  neighbour  hypersurface  separating 
X\  and  X2.  This  surface  may  easily  be  found  when  X\  and  X2  have  been 
determined. 

As  we  have  seen,  the  splitting  is  actually  a  clustering  problem.  This  problem  is 
quite  difficult  to  solve  due  to  the  astronomical  number  of  possible  partitions 
which  have  to  be  evaluated.  Thus  we  are  left  with  two  choices:  Either  finding  a 
much  more  efficient  search  method  than  the  exhaustive  search,  or  to  limit  the 
search  to  only  parts  of  the  space  of  partitions.  Optimal  splitting  according  to  a 
given  criterion  is  possible  for  a  very  limited  set  of  criterion  functions  only  [23], 
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and  unfortunately  it  cannot  be  used  with  the  criterion  function  we  found  to  be 
the  best,  therefore  we  have  to  use  a  suboptimal  algorithm  based  on  the 
principle  of  evolutionary  search.  Nevertheless  it  has  been  found  to  perform  well. 
Our  algorithm  consists  of  two  parts.  First  a  coarse  clustering  is  performed  for 
creating  an  initial  splitting.  Secondly,  an  optimization  procedure  based  on  the 
initial  splitting  is  performed.  In  4.2.1  the  initial  splitting  is  studied,  and  in  4.2.2 
the  optimization  procedure  is  considered. 

4.2.1  Initial  splitting  procedures 

There  are  two  ways  of  performing  a  hierarchical  clustering  of  data  sets,  either 
agglomerative  or  divisive  methods  [15].  In  agglomerative  methods  we  are 
starting  with  n  clusters,  one  sample  in  each  cluster.  Then  the  two  nearest 
(according  to  a  given  criterion  function)  clusters  are  merged,  and  there  are  now 
n  —  1  clusters  left.  This  process  continues  until  the  samples  are  clustered  in  a 
predefined  number  of  clusters.  Agglomerative  methods  may  perform  well  if  the 
samples  actually  fall  into  clusters  which  are  fairly  well  separated.  Otherwise, 
most  of  the  samples  usually  tend  to  be  included  in  one  cluster  only,  and  hence  a 
very  poor  splitting  will  result.  For  our  problems,  the  above  requirement  is 
generally  not  fulfilled  since  samples  may  have  been  taken  from  a  unimodale 
distribution.  Therefore  agglomerative  clustering  algorithms  are  not  appropriate. 

In  a  divisive  hierarchical  clustering  algorithm,  a  class  represented  by  a  set  of 
samples  is  split  into  two  clusters.  Divisive  algorithms  are  useful  for  splitting  in 
situations  where  all  samples  are  taken  from  a  unimodale  probability 
distribution.  We  have  studied  two  different  strategies.  These  will  be  described 
in  the  following  paragraphs.  In  4. 2. 1.3  problems  concerning  outliers  axe  treated. 

4. 2. 1.1  Splitting  based  on  the  scatter  matrix 

The  first  strategy  described  is  quite  simple  to  compute.  It  is  based  on  dividing 
the  data  set  along  the  direction  of  maximum  variance.  This  is  a  reasonable 
strategy  to  use  for  several  reasons.  First  of  all  the  method  results  in  new  clusters 
with  reduced  variance  in  the  direction  of  the  maximum  variance  of  the  parent 
cluster.  Therefore  we  may  assume  that  the  trace  of  the  (sub)classes  will  be 
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substantially  reduced.  Moreover,  if  we  assume  that  the  samples  in  the  data  set 
are  taken  from  a  Gaussian  distribution,  then  we  are  maximizing  the  projected 
distr  nee  from  the  mean  of  the  data  set  to  the  mean  of  the  nearest  (new)  cluster. 

This  direction  is  given  by  the  vector  a,  for  which 

Y,  [a‘(xfc  -  /i)]2  =  max  ( Y  [V(x*  “  /*)]2}  .  Iiall  =  INI  =  1  (4.6) 

k=i  >  U=i  J 

where  N  is  the  number  of  samples  in  the  given  data  set,  /i  its  mean,  and  7  is  an 
arbitrary  vector  of  unit  length.  The  vector  a  is  easily  obtained  by  using  the 
scatter  matrix  W  of  the  data  set  which  is  defined  as 

W  =  £  (x<  "  /*)  (x<  ~  (4-7) 

i=  1 

The  vector  a  is  parallel  to  the  eigenvector  corresponding  to  the  greatest 
eigenvalue  of  W.  When  the  direction  vector  is  found,  a  sample  x*  is  assigned  to 
cluster  1  if  a‘(xfc  —  jx)  >  0.  Otherwise  it  is  assigned  to  cluster  2. 

Thus,  using  this  algorithm  the  data  set  is  split  by  a  hyperplane  perpendicular  to 
a  which  is  passing  through  the  mean  of  the  data  set. 

4.2.1. 2  Splitting  based  on  starting  samples 

This  algorithm  starts  with  finding  two  samples,  one  “representative”  for  each 
(new)  clusters.  Each  unclustered  sample  is  then  assigned  to  one  of  the  clusters, 
i.e.  to  the  cluster  with  the  nearest  mean,  the  nearest  sample  etc.  The  algorithm 
may  be  illustrated  with  the  following  statements: 
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<Find  the  starting  samples  Xj,  and  X2,  for  X\  and  X 2  respectively >; 

FOR  i  :=  1  TO  (N- 2)  DO 

BEGIN 

IF  (/(x<,A\)  <  f(xi,X2))  THEN  fc:=l  ELSE  k:=2; 

Nk:=Nk  +  1; 

Xk'.~Xk  U  x,-;  (*  Add  the  sample  to  the  data  set  of  cluster  k.  *) 
END; 


Here,  Nk,  k  —  1,2  is  the  number  of  samples  in  cluster  k,  and 
Xk  =  {x*, ,  ■  •  •  ,Xkftk }  is  the  data  set  of  cluster  k.  X  =  {xl?  •  •  •  ,xN)  denotes  the 
data  set  being  split,  and  /  is  the  criterion  function.  We  see  that  the  algorithm 
consists  of  two  parts.  First  the  starting  samples  have  to  be  determined,  and  then 
the  rest  of  the  samples  are  to  be  assigned  to  one  of  the  two  clusters.  We  will 
start  with  the  problems  concerning  the  determination  of  the  starting  samples. 

4.2. 1.2. 1  Determination  of  the  starting  samples 

The  performance  of  the  splitting  algorithms  depends  very  much  on  properly 
chosen  starting  samples.  We  are  interested  in  a  splitting  resulting  in  a 
maximum  distance  from  the  mean  of  the  data  set  to  the  mean  of  the  nearest 
cluster.  Therefore  it  is  reasonable  to  define  two  widely  separated  samples  as 
starting  samples  and  then  let  the  (new)  clusters  “grow”  towards  each  other. 

Two  possible  ways  of  determining  these  samples  are  studied.  The  first  and 
simplest  way  of  choosing  the  starting  samples  is  to  use  the  two  most  distant 
samples.  Let  Xj,  and  x2l  denote  the  starting  samples.  Then 

llxii  ~x2,ll  =nwx{||Xi  -Xj||}  ,  i,j  =  l,2,--*,iV.  (4.8) 

However  this  method  is  not  robust.  If  for  instance  one  of  the  starting  samples  is 
an  outlier  (to  be  defined  in  4. 2. 1.3)  a  bad  initial  splitting  (clustering)  may 
occur.  Therefore  a  more  robust  strategy  is  proposed.  This  is  to  choose  the  two 
most  distant  vectors  projected  into  a  vector  a.  a  is  parallel  to  the  direction  of 
maximum  variance  and  is  computed  according  to  4.2. 1.1.  In  other  words  we 
choose  the  starting  samples  such  that 
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||a*(xu  —  x2l ) ||  =max{||a‘(xt  -xi)||}  ,  i, j  =  1,2, •  •  •  ,iV.  (4.9) 

This  approach  is  more  robust  with  respect  to  outliers  because  a  few  outliers 
have  moderate  influence  on  the  vector  a. 

4.2. 1.2.2  Splitting  of  a  subclass  when  the  starting  samples  are  given 

Let  us  assume  that  the  starting  samples  have  been  found.  Now,  the  problem  is 
to  assign  the  rest  of  the  samples  to  one  of  the  two  clusters.  Let  us  assume  that 
Ni  samples  have  been  assigned  to  cluster  1  and  JV2  to  cluster  2  (N\  +  N2  <  N). 
Furthermore,  X\  and  X2  denote  the  data  set  of  cluster  1  and  cluster  2 
respectively.  As  stated  earlier,  a  non-assigned  sample  is  assigned  to  the  cluster 
producing  the  smallest  value  of  a  given  criterion  function.  In  this  work  six 
different  criterion  functions  have  been  studied.  With  exception  of  the  first  one, 
they  are  all  described  by  Hand  [15]  in  connection  with  agglomerative  clustering. 
The  criterion  functions  are  as  follows: 

i)  The  distance  to  the  starting  sample. 


}\  (x,**)  =  ||x-xfcJ 


ii)  The  distance  to  the  nearest  sample. 


M*,**)  =  nfin  { |]x  —  Xjt,||}  ,  i  =  l,---Nk 


iii)  The  distance  to  the  furthest  sample. 
/3(x,«Tfe)  =  rnax{||x  -  xfc,||}  ,  i  =  !,-••,  Nk 
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iv)  The  distance  to  the  mean. 

f  4  (x,  Xk )  = 

v)  The  distance  to  the  median. 
h  (x,  Xk)  =  ||x  —  Med(A'fe)|| 

where  the  median  of  the  data  set  is  defined  as 

'  Med(xfcl(l),---,xfcWfc(l))  N 
Med(Tffc)  =  i 

^  Med (xkl(d),---,xkNi(d))  j 

Med(xjt1(i),  •  •  •  ,Xfcw  (i))  denotes  the  median  of  the  i’th  element  of  the  vectors  in 

A*. 

vi)  The  group  average  distance. 

h{*.,Xk)  =  llx“  xfc.ll 

The  first  criterion  is  equivalent  to  separate  the  data  set  using  a  hyperplane 
given  by  the  equation 

z<(xi,  -  x2l)  -  ^ (xij  -  X2l)‘(xh  -  x2l)  =  0  . 

Hence  it  should  be  very  clear  that  its  performance  depends  very  much  on 
properly  chosen  starting  samples.  When  using  this  criterion  it  is  of  great 
importance  to  be  able  to  handle  problems  caused  by  outliers. 


The  properties  of  the  other  criterion?  are  discussed  by  Hand  [15],  and  will  not 
be  recapitulated  here. 
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4.2. 1.3  Outlier  handling 

Outliers  may  be  considered  as  “wild”  samples,  -  samples  which  are  far  away 
from  the  rest  of  the  samples  representing  the  same  class.  A  somewhat  diffuse 
definition  may  be  as  fellows:  Assume  a  class  has  samples  taken  from  a 
distribution  with  the  density  p(r),  and  the  sample  Zq  is  to  be  tested.  If  p(z)  is 
very  small  in  a  neighbourhood  around  z0,  then  z0  is  considered  to  be  an  outlier. 
We  stated  earlier  that  use  of  eq.  4.8  should  be  avoided  because  it  is  very 
sensitive  with  respect  to  outliers.  Moreover,  outliers  do  also  have  influence  on 
all  criterion  functions  mentioned  in  the  last  paragraph.  By  detecting  the 
outliers,  we  may  reduce  their  influence  considerably.  This  may  be  achieved  by 
using  the  following  steps: 


a)  Detect  till  outliers  in  X  =  {xj,  •  •  •  ,  x#}. 

b)  Determine  the  starting  samples  by  using  all  samples  in  X  except  the 
outliers. 

c)  Assign  all  non-outliers  to  one  of  the  two  clusters. 

d)  Assign  all  outliers. 


A  Mahalanobis  distance  based  outlier  detector  will  most  likely  work  well  in 
many  practical  situations.  However,  in  appendix  A  an  alternative  and  more 
robust  outlier  detector  is  presented.  This  detector  is  used  in  the  following,  but 
using  a  Mahalanobis  distance  based  detector  would  probably  in  most  cases  give 
similar  results. 

4.2.2  Optimization  of  the  initial  splitting 

The  result  of  the  initial,  coarse  splitting  is  obviously  not  optimized  with  respect 
to  any  criterion.  Therefore  an  optimization  algorithm  should  be  applied. 

Various  optimization  criterions  may  be  found  in  Hand  [15].  They  are  all  based 
on  the  scatter  matrices.  The  within-class  scatter  matrix  is  defined  as 
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2  Nk 


w  =  £  £  (Xk.  -  Mfc)  (xfc,  -  /**)' 

fc=i t=i 


and  the  between-class  scatter  matrix  is  defined  as 

B  =  £  (/**  ~  A*)  (Mfc  “ 

Jk=l 


One  popular  criterion  is  the  trace  of  the  within-class  scatter  matrix,  tr(W), 
which  is  to  be  minimized.  This  is  a  reasonable  criterion  for  several  reasons. 

First  of  all,  tr(W)  is  the  sum  of  variances  (and  the  sum  of  eigenvalues)  of  W. 
Minimizing  this  trace  is  indicating  that  a  good  split  is  found  because  the  sum  of 
variances  is  “small”,  and  thus  the  subclasses  are  assumed  “compact”.  Secondly, 
as  shown  in  appendix  B,  minimizing  tr(W)  is  the  same  as  maximizing 
|(/*i  —  /x)‘(/x2  —  A*)l-  This  inner  product  may  be  viewed  as  the  distance  between 
and  /x  when  projected  into  the  vector  (/x2  —  /x)  which  is  also  a  reasonable 
criterion  to  optimize.  However,  the  tr(W)  is  little  robust.  If  the  data  set 
contains  outliers  or  some  of  its  samples  are  “sparsely”  distributed,  then  the 
majority  of  the  samples  may  end  up  in  one  of  the  clusters  only.  Furthermore, 
scaling  will  have  influence  on  the  clustering. 

Another  popular  criterion  is  the  determinant  of  W.  By  minimizing  |W|  we  are 
in  fact  minimizing  the  product  of  the  eigenvalues.  Contrary  to  the  tr(W), 
scaling  have  no  influence  on  the  |W|.  Except  for  this  feature,  the  disadvantages 
are  the  same  as  for  tr(W).  For  a  more  detailed  discussion,  see  Hand  [15j. 

As  an  alternative  to  the  scatter  matrices,  first  order  statistics  may  be  used.  A 
criterion  performing  well  is  the  distance  between  the  mean  of  the  original  data 
set  and  the  nearest  (new)  cluster.  This  criterion,  min{||/x,  —  /x||},  is  to  be 
maximized.  Its  main  advantage  that  it  is  robust  to  outliers.  However,  scaling 
will  have  influence  on  the  criterion. 

For  some  criterion  functions  (and  small  data  sets)  it  is  possible  to  split  the  data 
set  by  using  a  so-called  branch  and  bound  algorithm  [23].  This  algorithm  is 
able  to  find  the  optimal  splitting  of  the  data  set.  However,  the  computation 
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requirement  is  great  when  the  number  of  samples  is  large.  Moreover,  if  J * 
denotes  a  criterion  function  computed  using  k  samples  from  the  data  set,  only 
criterion  functions  for  which  J\  <  J2  <  ■  •  •  <  Jn  can  be  used.  Thus  tr( W)  may 
be  used,  while  min{||j*,  —  /*||}  is  not  applicable.  Therefore  we  have  to  use  an 
algorithm  based  on  the  suboptimal  evolutionary  search  principle  rather  than  the 
branch  and  bound  algorithm. 

The  evolutionary  search  procedure  uses  the  result  from  4.2.1  as  an  initial 
clustering.  Then  each  sample  is  considered  as  a  candidate  for  reallocation.  If 
reallocation  results  in  an  improved  value  of  the  criterion  function,  the  samples  is 
transferred  to  the  other  cluster.  Otherwise  it  remains  in  its  original  cluster.  One 
iteration  is  finished  when  all  samples  have  been  considered.  The  algorithm 
terminates  if  no  samples  are  transferred  during  an  iteration.  The  principle  is 
illustrated  by  the  following  statements: 

<Generate  an  initial  clustering  according  to  4.2.1  >; 

<Compute  the  initial  value  of  the  criterion  function>; 

REPEAT 

finished:=TRUE; 

FOR  i:=l  to  N  DO 
BEGIN 

< Change  the  assignment  of  sample  x,  >; 

<Compute  the  value  of  the  criterion  function>; 

IF  <Improved  value  of  the  criterion  function  >  THEN 
BEGIN 

finished:=FALSE; 

<Update  the  optimal  value  of  the  criterion  function>; 

END 

ELSE  <Change  the  assignment  of  sample  x,  >; 

END; 

UNTIL  (finished); 
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4.2.3  Evaluation  of  the  different  algorithms 

We  have  made  simulations  using  samples  taken  from  two,  three  and  five 
dimensional  distributions.  For  each  distribution  and  dimension,  we  have  made 
simulations  using  three  different  sample  sizes  (N  =  25,  N  =  75  and  N  =  150). 
We  have  used  distributions  containing  outliers,  “sparse”  distributions  and 
non-symmetrical  distributions  as  well  as  more  “compact”  distributions.  Briefly 
speaking,  a  distribution  is  considered  as  “sparse”  if  the  majority  of  the  samples 
in  a  data  set  drawn  independently  from  the  given  distribution  are  located  near 
the  mean,  and  the  rest  of  the  samples  are  located  both  far  away  each  other  and 
the  mean.  Contrary,  in  a  “  compact”  distribution  all  samples  are  located  near 
its  mean. 

We  define 

Si=I, 


E2  =  diag(4, 1)  = 


4  0 
0  1 


and 


E3  =  diag(l,  16). 


The  following  two-dimensional  distributions  have  been  used: 


-Two  Gaussian  distributions  with  covariance  matrices  Ej  and  S2 
respectively 

-Two  Gaussian  distributions  contaminated  with  an  “outlier  distribution”. 
We  have  chosen  a  distribution  having  the  density 

p(x)  =  (1  —  p)N(n  ,  E2)  +  pN{n  ,  E3),  and  simulations  are  carried  out  for 
p  =  0.10  and  p  =  0.25. 
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-  Two  double  exponential  distributions  with  covariance  matrices  Ej  and  E2. 

-  Two  negative  exponential  distributions  with  covariance  matrices  Ej  and 

S2. 

-  Rectangular  distribution  i?(0  :  1,0  :  1). 

-  Triangular  distribution  T{ 0  :  1,0  :  1). 


The  triangular  distribution  used  has  the  density  function 


P(x) 


2  , 0  <  xi  <  1  and  0  <  X2  <  Xi 
0  ,  Otherwise. 


Furthermore,  we  define 


xu  =  i, 

Es  =  diag(4, 1, 1), 


E6  =  diag(l,  16, 1), 


and  the  following  three-dimensional  distributions  are  used: 


-Two  different  Gaussian  distributions  having  covariance  matrices  E4  and 
E5  respectively 

-  Two  different  Gaussian  distributions  contaminated  with  outliers  according 
two  the  distribution  with  the  density 

p(x)  =  (1  —  p)N(fi  ,  E5)  +  pN(fi  ,  E6)  where  p  =  0.10  and  p  =  0.25 

-Two  different  double  exponential  distributions  with  covariance  matrices 
E4  and  E5 
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-Two  different  negative  exponential  distributions  with  covariance  matrices 
£4  and  £5 


Finally,  we  define 

S7=I, 


£8  =  diag(4, 1, 1, 1, 1), 


£9  =  diag(4,4,4, 1, 1), 


£10  =  diag(4,4, 1, 1, 1), 


£n  =  diag(l,  1,16, 1,1), 


£12  =  diag(l,  1, 16, 16, 16), 


and  simulations  have  been  carried  out  for  the  following  five-dimensional 
distributions: 


-  Three  different  Gaussian  distributions  with  covariance  matrices  £7,  £g 
and  £9. 

-  Four  different  Gaussian  distributions  contaminated  by  outliers,  with  the 
densities  pi(x)  =  (1  —  p)N(fi  , £g)  +  pN(n  , £u)  and 

p2(x)  =  (1  —  p)N(n  ,  £10)  +  pN(/i  ,  £12)  where  p  =  0.10  and  p  =  0.25. 

-  Three  different  double  exponential  distributions  having  covariance 
matrices  £7,  £g  and  £9. 
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-  Three  different  negative  exponential  distributions  having  covariance 
matrices  Sr,  S8  and  £9. 


The  following  three  criterions  are  used  for  the  evaluation: 


a)  min{||/*,  -/i||} 

b)  tr(W) 

c)  |W| 


The  motivation  for  using  these  criteria  is  found  in  4.2.2.  Furthermore,  the 
criterion 

minfJV!,  N2} 

N 

is  used  for  evaluating  if  there  is  an  unbalanced  number  of  samples  in  the 
subclasses. 

In  the  tests  involving  sample  sizes  of  N  =  25  and  N  =  75,  1000  replications  are 
used.  This  is  usually  sufficient  for  obtaining  reliable  statistics.  However,  only 
500  replications  are  used  for  N  —  150,  since  in  this  case  1000  replications  would 
require  around  60  CPU-hours  on  our  ND-570  computer  in  order  to  simulate  one 
distribution.  Thus,  2.6  CFU-months  is  needed  for  simulating  all  distributions. 
Even  with  the  reduced  number  of  replications,  the  CPU-requirement  is  great, 
but  further  reduction  is  not  justified.  The  reason  for  this  great  need  of 
CPU-time,  is  to  be  found  in  the  sorting  required  by  the  outlier  detector. 


The  amount  of  data  is  so  large  that  it  is  impossible  to  present  all  results  here. 
Instead  general  remarks  illustrated  with  some  results  are  given. 
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4.2.3. 1  Results  from  the  initial  splitting 

Let  us  first  present  some  of  the  simulation  results.  In  table  4.5  the  results  from 
the  three-dimensional  double  exponential  distribution  (S  =  I  and  N  =  75)  are 
presented.  The  results  from  the  five  dimensional  distribution 
p(x)  =  0.9JV(/i  ,  S8)  +  0.1  iV(/x  ,  S12)  are  shown  in  table  4.6.  Both  these 
distributions  are  fairly  difficult  to  split,  and  hence  the  performance  of  different 
strategies  is  well  illustrated.  The  tables  show  the  results  using  the  five  best 
splitting  methods  with  and  without  outlier  detection  in  order  to  illustrate  the 
difference  in  performance  when  using  outlier  detection.  Only  the  results  using 
the  eigenvector  based  approach  for  determining  the  starting  samples  are  shown 
because  it  is  much  more  robust  than  the  approach  given  by  eq  4.8. 

We  first  note  that  the  performance  of  the  algorithms  is  more  or  less  independent 
of  both  the  number  of  samples  and  the  feature  space  dimension.  This  was 
expected  since  there  is  no  mathematical  indication  that  the  splitting  algorithms 
depend  on  the  feature  space  dimension  or  the  number  of  samples  in  the  data 
set.  The  ranking  of  the  various  splitting  strategies  is  also  almost  independent  of 
the  sample  distribution.  Since  simulations  have  been  carried  out  using  a  wide 
variety  of  distributions,  it  is  reasonable  to  assume  these  results  to  have  a  quite 
general  validity. 

Not  surprisingly  the  best  results  are  obtained  by  using  the  eigenvector  based 
splitting  (E-V-M)  from  4. 2. 1.1.  We  know  that  in  the  Gaussian  case  the 
distance  between  the  means  of  the  parent  cluster  and  the  nearest  (new)  cluster 
is  maximized  when  projected  into  the  direction  of  maximum  variance  of  the 
parent  cluster.  As  exptected,  the  trace  of  the  scatter  matrix  is  also  substantially 
reduced.  Moreover,  the  strategy  does  not  depend  on  properly  chosen  starting 
samples.  The  simulation  does  also  show  that  it  is  very  robust  with  respect  to 
outliers. 

The  algorithms  outlined  in  4. 2. 1.2  are  much  less  robust  with  respect  to  outliers. 
Therefore,  the  eigenvector  approach  for  the  determination  of  the  starting 
samples  should  be  used.  Furthermore,  the  results  show  the  performance  of  the 
splitting  criterions  from  4. 2. 1.2  to  be  in  the  following  order: 
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1)  Nearest  starting  sample,  N-S-S  (/j) 

2)  Furthest  sample,  F-S  (/3) 

3)  Group  average  distance,  G-A-D  ( f6 ) 

4)  Mean  (/4) 

5)  Median  (/5) 

6)  Nearest  sample  (/2) 


These  results  do  correspond  to  the  findings  of  Bayne  et  al  [2]  and  Jain  et  al  [20]. 
Bayne  et  al  used  Monte  Carlo  simulations  to  estimate  the  percent 
misclassification  of  13  clustering  methods  for  six  types  of  parameterizations  of 
two  bivariate  normal  populations.  The  methods  were  compaxed  using 
probability  of  misclassification  and  incidence  matrices.  Jain  et  al  compaxed  six 
different  hierarchical  methods  on  univariate  random  data  (all  samples  were 
taken  from  the  same  distribution)  with  respect  to  their  tendencies  to  discover 
false  clusters. 

We  also  found  that  outliers  and  sparsely  distributed  samples  heavily  affected 
the  splitting.  If  one  of  these  conditions  occur,  the  number  of  samples  in  the 
clusters  becomes  very  skewly  distributed.  Hence  the  min{||/x,  —  /x || }  criterion 
signals  poor  splittings.  Examples  illustrating  these  findings  are  shown  in 
tables  4.5  and  4.6. 


Crit  func  Sta  sam  Out  det 


E-V-M 


IES3GSR9II 


mindly 


N-S-S  (4.25) 


E-V-M 


N-S-S  (4.25)  N 


(4.25)  N 


l 

0.129 

F-S 

(4.25) 

Y 

0.429 

0.052 

0.646 

0.122 

G-A-D 

(4.25) 

N 

0.153 

0.314 

2.331  0.456 

0.368  0.224 

2.347  0.401 

0.373  0.193 

2.407  0.509 

0.392  0.259 

2.356  0.456 

0.370  0.223 

2.459  0.542 

0.402  0.272 


0.404  0.278 


Mean 

(4.25) 

Y 

mm 

0.570 

2.380 

0.448 

■n 

0.162 

0.376 

0.222 

Table  4.5:  Initial  splitting  results  using  75  samples  taken  from  a  three  dimensional 
double  exponential  distribution  (E  =  I).  Both  mean  and  standard 
deviation  of  each  criterion  are  tabulated  for  each  splitting  function. 
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Crit  func 

Sta  sam 

Out  det 

R32S5K3B 

min{  ||/4,  -  n\\ } 

*|W| 

E-V-M 

- 

N 

MEM 

1.586 

11.897 

68.142 

BB 

3.528 

146.473 

E-V-M 

- 

Y 

1 

11.903 

42.490 

1 

3.522 

79.348 

N-S-S 

(4.25) 

N 

bob 

BH 

11.520 

50.508 

Hi 

B 

3.088 

133.274 

N-S-S 

(4.25) 

Y 

B 

■m 

11.999 

53.054 

BB 

3.491 

105.082 

F-S 

(4.25) 

N 

■a 

0.996 

11.975 

60.208 

BB 

0.420 

3.265 

133.546 

F-S 

(4.25) 

Y 

BBS 

BH 

12.398 

63.225 

bb 

3.732 

124.674 

G-A-D 

(4.25) 

N 

mm 

0.852 

11.336 

38.534 

BB 

0.480 

2.873 

74.997 

G-A-D 

(4.25) 

Y 

0.366 

.  -:V 

11.979 

44.978 

0.094 

3.497 

84.484 

Mean 

(4.25) 

N 

JUKI 

11.336 

fl 

2.825 

Mean 

(4.25) 

Y 

mm 

12.018 

47.999 

BB 

3.497 

100.798 

Table  4.6:  Initial  splitting  results  using  25  samples  taken  from  a  five  dimensional 
Gaussian  distribution  contaminated  with  10%  outliers.  Both  mean  and 
standard  deviation  of  each  criterion  are  tabulated  for  each  splitting 
function. 
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4.2.3.2  Results  from  the  optimization 

Tables  4.7-4.12  show  examples  of  simulation  results  obtained  by  using  the 
various  optimization  criteria.  In  tables  4. 7-4. 9  the  results  using  the  three 
dimensional  double  exponential  distribution  is  shown.  In  table  4.7,  the 
min{||/*,-  —  /i||}  criterion  is  used,  tr(W)  is  used  in  table  4.8,  and  finally  the  |W| 
criterion  is  used  in  table  4.9.  In  the  same  way,  the  results  using  the  five 
dimensional  Gaussian  distribution  contaminated  with  10%  outliers  are  shown  in 
tables  4.10-4.12. 

Simulations  are  made  for  all  three  optimization  criterions  tr(W),  |W|  and 
min{^,-  —  /i},  and  for  all  combinations  of  sample  sizes,  dimensionalities  and 
distributions  described  previously  in  this  section.  Each  criterion  is  tested  using 
various  initial  splittings.  The  purpose  has  been  to  study: 


-  Which  of  the  three  criterions  have  the  best  overall  performance, 

-  if  optimization  can  compensate  a  bad  initial  splitting, 

-  if  optimization  significantly  improves  good  initial  splittings  (splittings 
produced  by  the  eigenvector  based  method). 


For  each  optimization  criterion,  the  following  six  different  initial  splitting 
strategies  are  tested. 

-  Eigenvector  method  without  outlier  detection 

-  Furthest  sample  method  with  and  without  outlier  detection 

-  Group  average  method  with  and  without  outlier  detection 

-  Mean  method  without  outlier  detection 


The  eigenvector  method  is  only  tested  without  using  the  outlier  detector 
because  its  performance  is  almost  unaffected  by  outliers.  The  mean  method  is 
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also  used  without  the  outlier  detector  due  to  its  very  poor  initial  splittings.  The 
two  other  criteria  were  tested  both  with  and  without  the  outlier  detector. 

Also  here,  we  observe  that  the  performance  of  the  algorithms  is  almost 
independent  of  the  number  of  samples  used  in  the  data  set  as  well  as  the  sample 
space  dimension. 

Moreover,  if  the  data  set  contains  outliers  and/or  the  samples  are  sparsely 
distributed,  tr(W)  and  |W|  often  give  quite  unbalanced  subclasses  with  respect 
to  the  number  of  samples.  The  high  standard  deviation  of  the  — 
criterion  when  using  tr(W)  or  |W|  also  indicates  that  the  splitting  performance 
is  very  sensitive  to  the  actual  data  set.  Thus  the  reliability  may  be  poor  when 
using  tr(W)  or  |W|  as  optimization  criterions.  These  effects  are  avoided  by 
using  the  min{||/x,-  —  jt||}  criterion,  which  seems  to  have  the  best  overall 
performance. 

Furthermore,  the  criterion  based  on  first  order  statistics  is  found  to  be  little 
affected  by  the  quality  of  the  initial  splitting,  and  the  evaluation  criteria 
indicate  good  splittings  in  all  simulations.  The  criteria  based  on  the  scatter 
matrix  on  the  other  hand  seems  to  be  very  sensitive  to  the  initial  splitting. 
These  optimization  procedures  are  performing  best  when  using  the  best  initial 
splitting,  and  the  worst  initial  splitting  strategies  produce  very  poor 
“optimized”  performance.  Thus,  the  min{jj/x,  —  ft\\}  criterion  is  the  only 
(evaluated)  criterion  which  is  able  to  improve  a  bad  initial  splitting  significantly. 

Finally,  we  found  that  the  difference  between  the  initial  splitting  based  on  the 
eigenvector  and  the  optimized  splitting  is  only  marginal.  This  should  indicate 
that  the  eigenvector  method  produces  good  splittings,  and  thus  the 
optimization  procedure  is  hardly  needed  in  a  practical  situation. 

From  the  discussion  and  the  illustrations,  it  should  be  clear  that  the  eigenvector 
method  is  able  to  split  a  data  set  satisfactorily.  Furthermore,  if  optimization  of 
the  splitting  is  required,  the  min{||/i,  —  /i|j}  criterion  should  be  used. 
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Ini  spl 

Out  det 

minJUMi  -  /x||} 

HE! 

E-V-M 

N 

' 

0.826 

ESI 

0.069 

EH 

F-S 

Y 

0.493 

0.822 

2.344 

0.001 

0.070 

0.368 

HI 

G-A-D 

HQHj 

BB 

0.824 

2.340 

EH 

BH 

0.069 

0.366 

EH 

F-S 

N 

mm 

0.819 

2.349 

E&a 

0.371 

EH 

G-A-D 

N 

wmm 

0.821 

2.345 

m 

II 

0.071 

0.368 

EH 

Mean 

N 

Bl 

2.346 

0.436 

■Hi 

0.372 

0.211 

Table  4.7:  Optimized  splitting  using  the  optimization  criterion  based  on  first  or¬ 
der  statistics.  The  data  set  contains  75  samples  taken  from  a  three 
dimensional  double  exponential  distribution  (£  =  I).  Both  mean  and 
standard  deviation  are  tabulated. 


Ini  spl 

Out  det 

ndn{||M<-M  } 

WMM. 

*|W| 

E-V-M 

N 

0.696 

■ 

eh 

0.127 

i 

EH 

F-S 

■ 

wma 

0.689 

2.298 

0.394 

| 

mm 

0.122 

0.362 

0.189 

G-A-D 

Y 

0.374 

0.668 

2.299 

EH 

0.087 

0.132 

0.359 

EH 

F-S 

N 

0.602 

2.311 

0.419 

0.189 

0.369 

0.212 

G-A-D 

N 

0.453 

2.369 

EH 

■BH 

0.250 

0.399 

EH 

Mean 

N 

0.220 

0.431 

2.385 

0.477 

0.161 

0.255 

0.404 

0.261 

Table  4.8:  Optimized  splitting  using  the  tr(W)  criterion.  The  data  set  contains 
75  samples  taken  from  a  three  dimensional  double  exponential  distri¬ 
bution  (E  =  I). 


Table  4.9:  Optimized  splitting  using  the  |W|  criterion.  The  data  set  contains  75 
samples  taken  from  a  three  dimensional  double  exponential  distribu¬ 
tion  (S  =  I). 


min{||/i,  -/x|i}  I  itr(W) 


11.946  58.387 

3.458  130.897 


11.971  49.916 


95.886 


47.297 

81.275 


54.085 

106.906 


53.072 


3.472  107.789 


12.017  52.881 

3.470  107.678 


Table  4.10:  Optimized  splitting  using  the  optimization  criterion  based  on  first 
order  statistics.  The  data  set  contains  25  samples  taken  from  a  five 
dimensional  Gaussian  distribution  contaminated  with  10%  outliers. 


Ini  spl 

Out  det 

E-V-M 

N 

Mean 


0.180 

0.159 


Table  4.11:  Optimized  splitting  using  the  tr(W)  criterion.  The  data  set  contains 
of  25  samples  taken  from  a  five  dimensional  Gaussian  distribution 
contaminated  with  10%  outliers. 


Ini  spl 


E-V-M 


Out  det 


N 


nsKsai 


N 


.397 

.088 


0.406 

0.071 


0.390 

0.080 


0.286 

0.164 


.226 

.163 


.219 

.162 


™n{||/if-#*||} 


1.427 

0.314 


1.407 

0.303 


1.402 

0.312 


1.083 

0.451 


0.950 

0.481 


0.920 

0.481 


12.137 

3.447 


12.404  32.264 
3.568  62.226 


31. 


54. 


11.879  30.884 
3.166  60.341 


11.664  31.273 
3.094  57.670 


11.728  31.486 
3.098  57.729 


Table  4.12:  Optimized  splitting  using  the  |W|  criterion.  The  data  set  contains 
of  25  samples  taken  from  a  five  dimensional  Gaussian  distribution 
contaminated  with  10%  outliers. 
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4.3  Generating  piecewise  linear  classifiers 

Earlier  we  defined  a  piecewise  linear  classifier  as  a  classifier  where  the 
hypersurface  separating  two  classes  is  piecewise  linear.  In  this  section  we  will 
see  how  the  subclass  information  may  be  used  for  generating  a  piecewise  linear 
classifier.  Each  subclass  is  represented  by  a  data  set  containing  all  samples  x  for 
which  x  £  S{]  and  belongs  to  u>,-.  We  will  use  n;  weight  vectors  for  representing 
u>i  (one  weight  vector  for  each  subclass).  The  discriminant  function  will  be 
defined  as 


5»(x)  =  max{a‘7y} 


j 

y* 


1,2,  •••,«, 


(4.10) 


where  a,j  is  the  weight  vector  of  the  j’th  subclass  in  cj,.  Now,  x  is  assigned  to 
the  class  for  which  fl'.(x)  is  maximum.  It  may  be  shown  that  4.10  is  a  piecewise 
linear  classifier  [5]. 

In  the  next  four  subsections  we  will  study  various  approaches  for  generating 
piecewise  linear  classifiers  based  on  the  subclass  information. 

4.3.1  The  nearest  submean  classifier 

This  classifier  assigns  a  sample  x  to  the  class  with  the  nearest  subclass.  Let  the 
subclasses  o^,  a2,  •  •  *,  an  be  defined  according  to  4.4.  Moreover,  denotes  the 
mean  (in  feature  space)  of  a,.  Then  x  is  assigned  to  u;(afc)  if 


l|x  —  Mjtll  =  min{||x-/x,||}  ,  i  =  1,2 ,•••,« 

t 


(4.11) 


where  ce(afc)  returns  the  class  of  o^.  It  is  easy  to  verify  that  the  nearest 
submean  classifier  (NSMC)  is  piecewise  linear  and  may  be  defined  in  accordance 
with  4.10  [5].  The  weight  vectors  are  then  defined  as 


1  t  , 

^  =  | “2^^  ’ 


67 


when  using  an  augmented  feature  vector  representation.  Initially,  we  did  not 
expect  this  classifier  to  be  able  to  adapt  itself  sufficiently  to  the  data  set.  It  is 
also  affected  by  scaling  the  axes.  Therefore  other  strategies  have  been 
developed. 

4.3.2  Piecewise  linear  classifiers  based  on  the  mean  squared  approach 

A  more  advanced  approach  is  to  generate  a  piecewise  linear  classifier  which  is 
based  on  mean  squared  error  (MSE)  techniques.  Also  in  this  case  we  are 
defining  one  weight  vector  for  each  subclass,  and  it  is  quite  easy  to  generate  a 
classifier  using  a  standard  MSE  algorithm  [39].  First  the  n  subclasses  are 
defined  according  to  4.4.  Then  the  weight  matrix 

A  =  Jaj  :  a2  :  a3  :  •  •  •  : 

is  chosen  so  that  the  cost  function 

J  =  tr  {(YA  -  B)‘  (YA  -  B)}  (4.12) 

is  minimized.  The  matrix  Y  =  [yx  :  y2  :  •  •  •  :  yyvf,  dim{Y}  =  N  x  d,  contains 
the  samples  in  the  data  set.  iV,-  samples  represent  a,,  and  totally  there  are 
N  =  N\  -f  JVj  +  •  •  •  -f  Nn  samples.  Moreover,  the  samples  from  the  same 
subclass  are  assumed  to  be  contiguously  stored  in  Y.  That  is;  the  first  N\ 
samples  belongs  to  a},  the  next  N2  samples  to  a2,  and  so  on.  B  is  a  cost  matrix 
which  is  being  considered  later.  The  solution  minimizing  4.12  is  given  as  [39] 

A  =  Y+B 

=  (Y(y)-1Y‘B  (4.13) 

where  “+”  denotes  the  pseudo  inverse.  The  cost  matrix  B  may  be  interpreted 
in  two  different  ways  [39].  First  it  may  be  associated  with  the  cost  function  used 
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for  calculating  Bayes  risk.  Secondly  it  may  be  interpreted  as  a  set  of  vertices  in 
an  Euclidean  space,  to  which  the  samples  in  Y  are  mapped.  However,  in  this 
thesis  we  will  only  consider  the  latter  interpretation. 

Let 


Now,  the  samples  belonging  to  a i  are  attempted  mapped  into  the  point  A j, 
using  the  weight  vector  a and  the  weight  vectors  in  A  are  chosen  so  as  to 
minimize  the  squared  sum  of  mapping  the  samples  Y  into  the  vertices  given  in 
B.  The  problem  is  how  to  determine  the  elements  in  B.  Several  methods  have 
been  suggested  [39].  One  simple  possibility  is  to  select  the  following  cost: 


f 


1  ,  i  =  j 
0  ,  otherwise, 


(4.14) 
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which  means  that  the  samples  representing  o-j  are  tried  mapped  into  the  point  1 
and  the  rest  of  the  samples  into  the  point  0. 

We  have  also  tried  to  use  the  distance  between  the  means  of  the  subclasses  in 
order  to  make  a  more  reliable  classifier.  One  simple  approach  is  to  let  the 
samples  be  mapped  into  points  proporsional  to  the  distance  between  the 
subclasses,  i.e.  we  will  propose  the  following  risk  function: 


—  ||m<  —  Mill  ,  w(ai)  #  u{aj) 
,  otherwise. 


(4.15) 


Other  and  more  complex  ways  of  using  the  distance  information  for  generating 
the  coefficients  AtJ  have  also  been  studied.  However,  in  all  tests  the 
MS  El-classifier  using  4.14  for  generating  A  tJ  have  been  superior  to  the  distance 
based  ones.  Thus,  these  methods  are  not  considered  in  the  following. 

4.3.3  Determination  of  a  piecewise  linear  classifier  using  a  weighed 
MSE-approach 

When  using  the  mean  squared  error  approach  (MSE),  one  tries  to  map  all 
samples  onto  a  set  of  points  defined  by  elements  in  the  “risk  matrix”  B  for  a 
given  subclass  a;.  We  want  to  improve  the  MSE-classifier.  Then,  it  might  be 
useful  to  reduce  the  number  of  samples  involved  in  the  determination  of  each 
weight  vector,  or  generally  to  weigh  the  samples  from  the  different  subclasses. 

First,  it  is  no  reason  to  use  the  samples  representing  a,-,  j  ^  i,  a;(aj)  =  u>(ai ) 
when  determining  a,  (the  weight  vector  representing  a,).  Secondly,  we  will  also 
weight  the  contribution  to  the  mean  squared  error  from  the  various  subclasses 
differently.  This  is  done  in  order  to  make  each  weight  vector  sensitive  to 
selected  (sub)classes  only. 

In  the  following,  we  will  use  A,  B  and  Y  as  defined  in  4.3.2.  Moreover,  we  will 
define  Z  as 


Z  =  (YA  -  B)*(YA  -  B)  . 


In  the  MSE-approach  described  previously,  we  wanted  to  minimize 
J  =  tr{Z  }  =  '£Zkk.  (4.16) 

k=l 

It  is  easy  to  see  that 


min  {tr{Z}}  =  min {Zkk}  .  (4.17) 

fe=i 

Minimizing  4.16  is  thus  the  same  as  minimizing  each  diagonal  element  in  Z 
separatly.  Therefore,  in  the  following  we  will  study  the  determination  of  the 
fc’th  weight  vector  only.  Using  the  standard  MS  El-procedure,  this  corresponds 
to  determine  the  weight  vector  a*  minimizing 


2 


n 


Zkk  =  52 
j= i 


£ 


y!«k 


(4.18) 


where  y j  is  the  /’th  sample  vector  which  is  stored  in  row  no.  I  in  the  sample 
matrix  Y.  Furthermore,  a(y/)  returns  the  subclass  which  y /  belongs  to.  For 
details  about  the  vector  A*,  see  section  4.3.2.  Moreover, 


a  k 


is  chosen  in  order  to  minimize  the  squared  sum. 


By  not  including  the  samples  from  a;,  u(atk)  =  oj(q}),  we  are  reducing  the 
number  of  samples  which  is  mapped  into  the  point  0.  Thus,  we  achieve  the  cost 
function 
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E 

j~k 

j  :  u(ctj)  ^  u>(afc) 


E  [y‘a*  -  v]' 

yi)=o,j 


(4.19) 


This  cost  function  may  be  generalized  by  weighting  the  contribution  to  the 
squared  error  from  each  subclass.  Then  we  obtain  the  following  cost  function 


Jk=J2  Pi*  H  [y'a*  -  vj 2  >  Pjk  >  o 


(4.20) 


«=1  l:a(yt)=a] 


We  can  easily  see  that  by  selecting  =  1  V  j  we  obtain  4.18  and  by  defining 


1  ,  j  =  kor  u(atj)  ±  w(ajfc) 

0  ,  otherwise 


we  will  obtain  4.19. 

We  wanted  to  test  whether  it  was  desirable  to  weigh  the  squared  error 
contributions  from  the  nearest  subclasses  more  than  the  most  distant  ones.  This 
may  be  obtained  by  letting  /3jk  be  inversely  proporsional  to  djk,  and  therefore 
we  have 


0]k  <  @kk 


(4.21) 


which  is  also  in  accordance  with  4.18  and  4.19.  The  actual  choice  of  / 3jk  is  hardly 
important.  Intuitively,  one  possibility  is  to  let  /?**  =  1,  and  then  in  accordance 
with  4.21  require  0  <  /?>*  <  1.  This  requirement  can  be  fulfilled  by  defining 


1  ,  j  =  k 

d‘  \p  ,  , 


Pik  =  <  (jrj)  ,  w(«i)  ^  <*>(<**) 

0  ,  otherwise 


(4.22) 
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where  p  >  0  and  d'k  is  the  distance  from  the  mean  of  ak  to  the  mean  of  the 
nearest  subclass  representing  another  class.  We  see  that  p  =  0  results  in  4.19. 
Furthermore,  when  p  is  “great”,  then 

0  ,k  _  |  1  »  3  =  k  or  d'k  =  d:k 

(  «  0  ,  otherwise  . 

We  will  now  derive  the  weight  vectors  minimizing  4.20.  The  gradient  of  the  cost 
function  is 


V  Jk  =  2 


E  ft*  E  (aly/  ~  xjk)  y / 


^a(yi)=«j 


(4.23) 


Now,  the  weight  vector  minimizing  4.20,  is  the  one  for  which  VJfc  =  0.  We 
obtain 


£/%*  E  y/yJ 

,J=1  ha(yt)=a, 


afc 


Y' 


=  E&*A**  E  y»  • 

j=l  l:a(y,)=a} 

> - - - ' 

y' 


(4.24) 


Thus  4.24  implies 

afc  =  r’y'  (4.25) 


which  is  the  solution  to  our  minimization  problem. 

4.3.4  Generating  a  piecewise  linear  classifier  from  a  set  of  hyperplanes 

As  mentioned,  in  this  approach  we  first  define  hyperplanes  separating  some  of 
the  (sub)classes.  Then  the  hyperplanes  are  used  for  generating  a  piecewise 
linear  classifier.  Three  problems  have  to  be  solved: 
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a)  How  to  select  the  pairs  of  (sub)classes  to  be  separated  by  a  hyperplane, 

b)  How  to  generate  a  linear  classifier, 

c)  How  to  use  the  linear  classifiers  for  generating  the  piecewise  linear 
classifier. 


4.3.4. 1  Determining  the  pairs  of  subclasses  to  be  separated 


Let  us  first  assume  the  density  of  each  class  and  the  mean  og  each  subclass  to  be 
known.  Let  us  also  for  a  while  assume  p(xju;(at))  to  be  monotonic  along  a  line 
going  from  the  mean,  pt,,  of  the  (sub)class  (we  are  only  using  this  assumption  for 
illustration  purposes). Now,  two  (sub)classes,  say  a,-  and  ay,  are  assumed  to  need 
separation  if  the  optimal  hypersurface  is  intersecting  the  line  £,y  going  from  fx{ 
to  fij  fx-  is  assumed).  This  proposal  may  be  illustrated  with  the  following 

statements  in  “quasi-Pascal”.  We  are  also  here  assuming  n  subclasses,  that  u 
is  split  into  n,  subclasses.  Hence  n  —  nc  is  the  last  subclass  number  belonging  to 
u>c-i-  Moreover,  let  f(i)  denote  the  last  subclass(number)  belonging  to  u/(a,). 

FOR  i  :=  1  TO  n  —  nc  DO 
BEGIN 

FOR  j  :=  /(*)  +  1  TO  n  DO 

BEGIN 

cCompute  X,-;  of  /y  (if  it  exists)  for  which 

P(u(ai))p(Xij\u(oi))  =  P{ui(ctj))p(Xij\ui(ctj))  >; 

IF  <  Xy  exists>  THEN  b:=TRUE  ELSE  b:=FALSE; 

k  :=  1; 

WHILE  {(k  <  n)  AND  (b))  DO 

BEGIN 

IF  ((Jfc  <>  «)  AND  (Jfc  <>  ;))  THEN 
BEGIN 

IF  (P(w(a*))p(xy|w(a*))  >  P(a/(ay))p(x,y|w(ay)))  THEN 
b:=FALSE; 

END; 
k  :=  k  +  1; 

END; 

IF  (b)  THEN  cCompute  a  hyperplane  separating  a,-  and  a;-  >; 

END; 

END; 


Estimation  of  the  “density”  function  of  the  subclass  may  lead  to  poor 
performance  because  in  many  cases  only  a  few  samples  are  contained  in  parts  of 
the  sample  space. 


Alternatively,  the  concept  of  the  Mahalanobis  distance  may  be  used.  This 
distance  is  defined  as 


D-£(x i,xj)  =  (xi  -  x2)‘  E'1  (Xj  -  x2) 


where  E  is  a  positive  definite  matrix.  The  decision  algorithm  given  previously 
has  to  be  slightly  modified.  First  we  determine  the  point  x,j  on  the  line  for 

r 

which 


where  E,  denotes  the  covariance  matrix  of  a,-.  Then,  a  hyperplane  should  be 
generated  if 


max  {Z?Sk(xi;j,  nk)}  >  £s,(xt-j,  **,-)  ,  k  #  i,j  . 

k 

It  should  be  noticed  that  no  use  of  a  priori  information  is  shown  above. 
However,  this  information  is  used  ad  hoc  by  simply  weighing  the  Mahalanobis 
distance  with  the  a  priori  probability.  If  the  samples  of  the  subclasses  are  drawn 
from  Gaussian  distributions,  then  using  the  Mahalanobis  distance  in  the 
decision  equals  the  previous  strategy.  Equal  a  priori  probability  for  all  classes  is 
then  assumed. 
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The  Mahalanobis  distance  strategy  does  not  detect  that  subclasses  often  have  a 
rather  bounded  sample  space.  Therefore,  non-interesting  subclasses  with  large 
variances  may  cause  trouble  even  when  they  are  located  far  away  from  the 
interesting  subclasses.  This  case  is  illustrated  in  figure  4.6  where  the  problem  of 
deciding  if  a  hyperplane  should  be  generated  for  separating  the  subclasses  q2 
and  a4  is  shown. 


Figure  4.6:  An  illustration  of  a  situation  where  the  Mahalanobis  distance  strategy 
fails.  The  resulting  classifier  is  based  on  discriminating  ct\  and  a4  only 
(as  illustrated  with  the  solid  line).  However,  a  clearly  more  reliable 
classifier  is  obtained  if  it  is  based  on  discriminating  both  a2  and  a4 
as  well  as  a3  and  a4  (as  shown  with  the  dashed  lines). 

We  see  that  D'£l(x24,  is  less  than  f?Ej(x24,  j*2).  (-^Ei (x34» Mi)  is  also  less 
than  D-£3(x 34,  Ms)-)  Hence  a  very  poor  decision  about  which  subclasses  to  be 
separated  will  be  made,  and  the  resulting  classifier  will  most  likely  provide  a  low 
performance.  If  we  instead  decide  to  combine  two  hyperplanes;  one  separating 
ar2  and  a4 ,  and  one  separating  q3  and  a4,  the  resulting  classifier  may  be  clearly 
more  reliable,  as  illustrated  in  the  figure.  Thus,  we  will  now  propose  another 
approach  which  has  been  found  to  work  better  than  the  Mahalanobis  distance 
in  many  examples. 
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This  alternative  approach  to  the  Mahalanobis  distance  is  based  on  the  projected 
(empirical)  cummulative  distribution.  Let  cq  denote  the  standard  deviation  of 
the  samples  projected  into  the  vector  b  (in  other  words:  <7,  =  b‘  £  b).  Assume 
for  a  while  the  samples  within  a  subclass  to  be  Gaussian  distributed.  The 
projected  cummulative  distribution  ( y  =  b*x)  is  given  as 


rb'x 

F(  b‘x)  =  / 

J— OO 


dz  . 


(4.26) 


Since  the  Mahalanobis  distance  also  may  be  written  as 

2 


*<«■»>  -  ■ 


(4.27) 


one  may  see  that  there  is  a  connection  between  the  Mahalanobis  distance  and 
the  projected  cummulative  distribution.  Hence  it  may  be  used  in  the  same  way 
as  the  Mahalanobis  distance.  However,  using  the  projected  cummulative 
distribution  provides  a  different  distance  measure.  Intuitively,  it  is  more  closely 
related  to  the  actual  distribution  than  the  Mahalanobis  distance.  Thus,  it  may 
manage  to  handle  “difficult”  situations  (e.g.  situations  where  the  variance(s)  of 
the  subclasses  differs  much  from  subclass  to  subclass)  satisfactorily.  Contrary  to 
the  Mahalanobis  distance,  it  does  also  detect  bounded  sample  spaces.  Hence,  it 
is  more  or  less  unaffected  by  situations  similar  to  figure  4.6.  However,  one  may 
expect  the  projected  cummulative  distribution  method  to  produce  somewhat 
unreliable  decisions  when  only  a  small  number  of  samples  are  representing  a 
subclass.  Thus,  a  relatively  large  design  set  is  necessary  if  a  large  number  of 
splittings  is  required. 


The  decision  algorithm  becomes  very  much  similar  to  the  Mahalanobis  distance 
based  one.  Let  us  also  here  assume  that  a;  and  otj  are  to  be  tested.  The  point 
Xjj  is  now  defined  as  the  point  of  equal  projected  cummulative  probability 
(projected  into  a  vector  parallel  to  the  direction  of  tt]),  and  it  is  to  be 
determined  first.  Next,  for  each  ak,  k  ^  i,j,  we  examine  the  projected 
cummulative  probability  (projected  into  a  vector  parallel  to  the  line  going 
between  x*,  and  pk)  in  order  to  make  the  decision  if  a  hyperplane  is  to 
discriminate  a,-  and  Qj.  Even  though  the  subclasses  are  not  strictly  defined  in 
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general,  we  will  in  the  following  assume  so  (i.e.  assuming  the  class  densities  and 
the  (sub)sample  spaces  to  be  known)  in  order  to  improve  the  understanding  of 
the  strategy. 

Let  Pb(y !<*«)  denote  the  “density”  given  the  subclass  when  projected  into  the 
vector  b.  As  mentioned,  the  point  xi;-,  which  is  the  equal  projected 
cummulative  probability  point  (on  iij),  has  to  be  determined  first  (a  solution  is 
assumed  to  exist).  In  other  words: 


r  OO  rbx') 

P(oti)pb(y\aij)  dy  =  /  P{aj)ph{y\oci)  dy  =  P{j 

Jh'Xi,  J-  oo 


(4.28) 


where  b  = 


(Pr±j) 
\\Pi-Pj\ V 


P(ai)  =  P(a:). 


This  is  illustrated  in  figure  4.7  for  the  situation  where 


Figure  4.7:  Determining  x,j  using  the  projected  cummulative  distribution. 


Now,  let  us  define  bfe  as 
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,  _  fa  ~  t*k) 
k  l|x<i  -/*t||  ' 


If 


k  ^  i,j 


(4.29) 


then  a,  and  aj  is  assumed  not  to  need  separation.  In  the  Mahalanobis  distance 
approach,  this  corresponds  to  the  situation  for  which  the  Mahalanobis  distance 
between  /*fc  and  x,y  is  less  than  the  Mahalanobis  distance  between  nt  (or  fiA 
and  x,j. 


The  strategy  is  implemented  in  the  following  algorithm: 


FOR  i  1  TO  n  —  nc  DO 
BEGIN 

FOR  j  :=  /(f)  +  1  TO  n  DO 
BEGIN 


<Compute  the  point  x,_,  of  £,;  (if  it  exists)  according  to  eq.  4.28>; 
IF  <  x,j  exists>  THEN  b:=TRUE  ELSE  b:=FALSE; 
k  1; 

WHILE  ((ifc  <  n)  AND  (b))  DO 
BEGIN 

IF  ((*  <>  f)  AND  (k  <>  j ))  THEN 
BEGIN 


IF  <eq.  4.29  holds>  THEN  b:=FALSE; 

END; 
k  :=  k  +  1; 

END; 

IF  (b)  THEN  <Compute  a  hyperplane  separating  a,  and  a >; 

END; 

END; 
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The  probabilities  used  are  of  course  estimated.  Let  yj1  =  {y,-,  :  ■  ■  •  :  y,N  }  denote 
the  data  set  representing  a,  (i.e.  A*,-)  projected  into  the  vector  b,  and  also 
assume  y,-,  <  y,2  <  •  •  ■  <  y,^  .  Then  a  reasonable  estimator  for  the  projected 
cummulative  distribution  is 

h(y  >  y'k)  =  ^  (4.30) 

where  l  is  the  largest  integer  for  which  y<,  <  y'. 

This  algorithm  is  quite  robust  and  it  has  been  found  to  work  pretty  well. 
However,  it  is  one  situation  where  it  will  not  work.  Assume  the  subclasses  to  be 
well  separated.  This  corresponds  to  situations  for  which  y^  <  yj, ,  where 
ViN  €  yj5,  y21  €  3^*,  and  b  =  Then  we  will  find  that  F|;  =  0,  and  we 

only  know  that  xtJ  should  lie  between  x,w  and  x21 .  The  solution  to  this 
problem  is  to  use  estimates  of  the  density  of  the  subclasses  based  on  the  k 
nearest  neighbour  method  in  stead  of  the  projected  cummulative  distribution. 
This  can  be  done  because  for  any  point  x  on  £ij  for  which  b'x  <  min{y1Jt}  (or 
b*x  >  max{y,k}),  the  estimate  is  changing  monotonically  with  b'x. 

4.3.4.2  Generating  a  suitable  linear  classifier 

When  we  have  decided  which  pairs  of  subclasses  are  to  be  discriminated  by  a 
hyperplane,  these  hyperplanes  have  to  be  constructed.  During  the  years,  several 
algorithms  have  been  developed  for  this  purpose  [8]. 


Clark  and  Gonzalez  [4]  have  recently  presented  an  interesting  linear  classifier  for 
the  two-class  problem.  Their  approach  minimizes  the  number  of 
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misclassifications.  Thus,  it  minimizes  the  estimate  of  the  error  rate  (based  on 
the  training  samples)  if  P(uii)  =  jvj+jVj  >  *  =  However,  it  would  have  been 
preferable  to  obtain  a  classifier  minimizing  the  error  rate  regardless  of  the 
a  priori  probability. 

Gallant  [14]  has  proposed  an  algorithm  called  the  pocket  algorithm  (PA).  This 
is  based  on  an  error  correction  procedure,  and  it  is  minimizing  the  number  of 
misclassifications  when  the  number  of  samples  reaches  infinity.  The  pocket 
algorithm  derived  its  name  from  the  process  of  saving  “in  your  pocket”  the 
weight  vector  with  the  longest  consecutive  run  of  correct  classification  trials  in 
the  error  correction  procedure.  The  classes  u> ,•  and  u>,  are  to  be  separated,  and 
the  algorithm  works  as  follows: 


a  :=  0;  (*  a  is  the  current  weight  vector.  *) 

run_  of_  corr_  class  :=  0;  run_  of_  corr_  class,  p  :=  0;  it  :=  0; 

REPEAT 

<Randomly  pick  a  training  sample  x*  >;  it  :=  it  +  1; 

IF  <correctly  classified>  THEN 
BEGIN 

run.  of.  corr.  class  :=  run.  of.  corr.  class  -I-  1; 

IF  (run.  of.  corr.  class  >  run.  of.  corr.  class,  p)  THEN 
BEGIN 

a,,  :=  a; 

run.  of.  corr.  class,  p  :=  run.  of.  corr.  class; 

END; 

END 

ELSE  BEGIN 

IF  (uj(xk)  =  <^i)  THEN  a  :=  a  +  xjt 
ELSE  a  :=  a  —  x*; 

END; 

UNTIL  (it  >  itmax); 


Unfortunately  the  (design  set  based)  error  rate  estimate  is  generally  not 
decreasing  monotonically  as  the  number  of  consecutive  correctly  classified 


samples  is  increasing.  Moreover,  it  does  not  exist  any  known  bound  on  the 
number  of  iterations  needed  for  producing  sufficiently  good  weight  vectors. 
Furthermore,  the  error  rate  estimate  is  more  preferable  as  minimizing  criterion 
than  consecutive  the  number  of  correct  classification  trials.  Therefore  we  modify 
the  pocket  algorithm  slightly. 

It  is  known  that  only  a  finite  number  of  different  weight  vectors  can  be  reached 
using  the  error  correction  learning  [14].  Therefore,  by  investigating  the  error 
rate  estimate  of  the  design  set  {or  generally  a  cost  function)  instead  of  the 
number  of  consecutive  correctly  classified  samples  we  are  able  to  minimize  the 
design  set  based  error  rate  estimate.  Moreover,  it  is  easy  to  see  from  the 
algorithm  to  be  presented  that  the  design  set  based  error  rate  estimate  is 
decreasing  monotonically  with  increasing  number  of  iterations.  Thus,  the 
algorithm  will  always  return  the  best  weight  vector  (the  weight  vector  giving 
the  lowest  design  set  based  error  rate  estimate)  of  those  which  have  been 
evaluated.  This  modified  pocket  algorithm  (MPA)  is  as  follows: 

a  :=  0;  (*  a  is  the  current  weight  vector.  *) 

A P  :=  A  :=  1;  it  :=  0; 

REPEAT 

<Randomly  pick  a  training  sample  Xk  >;  it  :=  it  4-  1; 

IF  <misclassified>  THEN 
BEGIN 

IF  (a(xfc)  =  a,)  THEN  a  a  -f-  Xk 
ELSE  a  :=  a  —  x*; 

if  (A  <  A„)  then 

BEGIN 


ap:=a; 

END; 

END; 

UNTIL  ((it  >  it  max)  OH  ( Ap  —  0)); 


The  price  to  pay  for  this  improvement  is  increased  computation  time. 


As  an  example,  let  us  compare  the  PA  and  MPA  using  the  data  set  shown  in 
figure  4.8.  In  this  example  u»i  is  represented  by  150  samples  taken  from  a 

./  h  ol\ .  ..  ... 


N  0, 


distribution,  and  is  represented  by  150  samples  taken  from 


1  ^  V  L  J  '  /  ^S^r^U^OQ'  k*  ^Sure  ^-9  error  ra^es  (based  on  the  training 

samples)  are  plotted  as  a  fun" '  on  of  number  of  iterations  in  order  to  show  how 
the  classifiers  converge. 


-? . S8S 


Figure  4.8:  The  samples  representing  the  classes. 
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Figure  4.9:  The  design  set  based  error  rate  estimate  for  the  PA  (solid  line)  and 
MPA  (dashed  line)  based  on  the  data  set  in  figure  4.8  as  a  function 
of  the  number  of  iterations. 
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As  we  see  from  the  figure,  the  error  rate  of  the  MPA  is  decreasing 
monotonically,  and  moreover  the  MPA  performs  better  than  the  PA. 

4. 3.4.3  Computing  the  piecewise  linear  discriminant  function 

In  4. 3.4.1  and  4. 3. 4. 2  we  have  described  algorithms  for  finding  the  pairs  of 
subclasses  which  have  to  be  separated  by  hyperplanes,  and  for  generating  the 
actual  hyperplanes.  Having  obtained  this  knowledge,  a  piecewise  linear  classifier 
is  to  be  generated.  Let  us  assume  that  m  pairs  of  subclasses  should  be 
separated,  and  that  the  m  hyperplanes  have  been  generated.  Furthermore,  let 
a'kJk  define  the  hyperplane  separating  the  subclasses  ct{k  and  ctjk  (1  <  u  <  n 
and  1  <  jk  <  n ).  Since  a \kJk  is  separating  these  two  subclasses,  it  must  be  a 
solution  of  the  equation  a,k  —  aJk  =  a'lkJk .  Now,  we  want  to  find  the  weight 
vectors  a1?  •  •  -  an  (one  for  each  subclass)  by  solving  the  following  set  of  equations 


- 

=  aU 

clt'2 

1 

•  ••  p 

M 

=  aU 

(4.31) 

*<m 

3Jm 

=  a' 

‘mjm 

Unfortunately,  we  are  not  guaranteed  a  unique  solution  of  4.31.  In  fact,  we  may 
also  either  have  an  infinite  number  of  solutions,  or  a  solution  does  not  have  to 
exist  at  all.  In  the  first  situation,  we  are  interested  in  finding  one  of  the  possible 
solutions,  and  in  the  latter  one,  the  best  thing  to  do  is  to  use  the  solution 
minimizing  the  mean  squared  error.  It  may  be  shown  that  an  “optimal” 
solution  (with  respect  to  the  least  square  minimum  norm)  can  be  obtained  in  all 
these  situations  by  using  the  singular  value  decomposition  method  (SVD)  [27], 
For  details,  see  appendix  C  where  we  have  given  a  brief  presentation  of  the 
method  applied  to  our  problem. 

However,  difficulties  will  occur  if  at  least  one  subclass  is  not  involved  in  the 
hyperplane  separation.  The  solution  to  this  problem  is  simply  to  detect  all  these 
subclasses,  mark  them  as  passive,  and  not  include  them  in  4.10. 
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4.3.5  Comparison  of  the  performances  of  the  different  classifiers 

In  4.1.4  it  was  argued  that  the  only  way  of  comparing  the  two  suggested 
splitting  strategies  is  through  Monte  Carlo  simulations.  The  same  argument  is 
valid  for  the  evaluation  of  the  performance  of  the  different  classifiers.  We  have 
made  simulations  using  the  same  distributions  as  in  4.1.4.  Moreover,  all 
classifiers  derived  earlier  in  4.3  have  been  tested,  in  other  words 


-  The  nearest  submean  classifier  (NSMC), 

-  The  mean  squared  error  classifier  (MSE), 

-  4  classifiers  based  on  the  weighed  mean  squared  error  approach  (W-MSE), 

-  The  hyperplane  based  classifier  (HPC). 


Unfortunately  it  is  impossible  to  present  all  results  from  the  simulations. 
Therefore  examples  will  be  used  to  illustrate  the  main  results. 

We  found  the  ranking  of  the  classifiers  (based  on  the  performance)  to  be  more 
or  less  independent  of  the  number  of  samples  in  the  design  set. 

First,  the  W-MSE  classifier  was  evaluated.  We  have  tested  4  different  values  of 
the  power  ( p  -  see  4.22)  in  order  to  find  how  much  the  contribution  to  the 
squared  error  from  each  subclass  should  be  weighed.  We  have  made  simulations 
using  p  =  0,i,  1,3  and  we  found  that  the  best  classifier  is  produced  for  p  —  0. 
Moreover,  we  found  the  performance  to  decrease  monotonies lly  with  p.  This 
fact  indicates  that  the  contribution  from  the  different  subclasses  for  which 
w(q,)  7^  w(afc)  should  be  almost  equally  weighted  when  determining  a*.  In 
figure  4.10  these  results  are  shown.  The  samples  representing  and  u>2  are 
drawn  from  Gaussian  distributions  ( d  =  0.5,  7  =  60°)  in  accordance  to  4.1.4. 
Moreover,  25  samples  are  representing  each  class. 


Next,  the  best  W-MSE  classifier  (p  =  0)  is  tested  against  the  MSE,  the  NSMC 
and  the  HPC.  As  expected,  we  found  the  W-MSE  classifier  to  perform  better 


Figure  4.10:  Mean  performance  of  the  W-MSE  classifier  as  a  function  of  number 
of  subclasses.  25  samples  are  representing  each  class  and  they  are 
drawn  from  Gaussian  distributions  ( d  =  0.5  and  7  =  60°). 

than  the  MSE  classifier.  This  is  reasonable  because?  all  unnecessary 
contributions  to  the  squared  error  are  removed.  However,  the  mean  squared 
error  based  approaches  do  not  perform  as  good  as  the  two  others.  The 
simulations  show  the  NSMC  (section  4.3.1)  to  perform  quite  well,  and  it  adapts 
quite  fast  to  the  data  set.  The  disadvantage  is  that  the  classifier  is  based  only 
on  the  the  mean  of  the  subclasses.  Hence,  the  classifier  may  not  be  well 
adjusted  to  the  data  set  for  a  given  number  of  subclasses.  This  fact  is  easily 
illustrated  by  considering  the  situation  where  the  direction  of  the  vector 
perpendicular  of  a  hyperplane  separating  two  (sub)classes  (e.g.  a;  and  aj) 
satisfactorily,  differs  significantly  from  the  direction  of  the  vector  /z,  — 

The  HPC  (section  4.3.4)  is  found  to  do  a  good  job.  It  adapts  nicely  to  the  data 
set,  often  using  only  a  few  subclasses.  However,  in  situations  where  the  weight 
vectors  cannot  be  found  exactly,  we  are  not  guaranteed  a  well  performing 
classifier  (for  details  see  section  4. 3. 4. 3).  This  effect  is  seen  in  the  three  class 
tests  where  the  NSMC  adapts  somewhat  faster  to  the  data  set  than  the  HPC. 
However,  the  HPC  is  superior  to  the  MSE  based  classifiers  in  this  case  too. 
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In  figures  4.11-4.13  these  findings  are  illustrated.  In  figure  4.11  the  classes  are 
Gaussian  distributed  (d  =  1.0,  j  =  60°),  and  in  figure  4.12  the  classes  are 
Gaussian/banana  distributed  using  d  =  1.5.  Finally  in  figure  4.13  the  results  of 
the  three-class  problem  using  d  =  1.0  is  shown.  All  plots  are  based  on  using  150 
samples  for  representing  each  class  in  a  given  replication. 


2.0  S.O  1.0  S.O  8.0  1.0  0.0 


**». 

Figure  4.11:  Mean  performance  as  a  function  of  number  of  subclasses  using  Gaus¬ 
sian  distributed  samples.  150  samples  are  representing  each  class  in 
a  given  replication,  d  =  1.0  and  7  =  60°. 
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Figure  4.12:  Mean  performance  as  a  function  of  number  of  subclasses  using  Gaus¬ 
sian  and  banana  distributed  samples.  150  samples  are  representing 
each  class  in  a  given  replication  and  d  =  1.5. 


Figure  4.13:  Mean  performance  as  a  function  of  number  of  subclasses  in  the  three- 
class  problem.  150  samples  are  representing  each  class  in  a  given 
replication. 
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4.4  Determining  the  final  classifier 

According  to  chapter  3,  the  splitting  process  is  at  some  time  to  be  terminated. 
For  example,  one  may  stop  the  splitting  when  the  performance  of  the  classifier 
is  sufficiently  good.  However,  there  is  one  great  problem:  The  performance  does 
not  generally  increase  monotonically  as  the  number  of  subclasses  is  increasing. 
Therefore  we  have  to  continue  the  splitting  until  either  the  maximum  permitted 
number  of  subclasses  is  reached,  or  we  are  convinced  that  the  performance  will 
not  be  significantly  improved  by  further  splitting. 

4.4.1  Termination  of  the  splitting 

First,  the  maximum  number  of  subclasses,  and  hence  the  maximum  number  of 
discrimination  functions  of  the  classifier  is  to  be  determined.  Afterwards,  the 
splitting  proceeds  until 


a)  the  maximum  number  of  subclasses  is  reached, 

b)  no  subclasses  can  be  split  according  to  a  certain  criterion, 

c)  the  greatest  contribution  from  a  subclass  to  the  performance  is  sufficiently 
small. 


When  the  number  of  samples  in  a  given  subclass  is  small,  it  is  difficult  to  do  any 
reliable  splitting.  Therefore,  the  second  criterion  is  defined.  The  third  criterion 
terminates  the  splitting  when  it  is  reasonable  to  believe  that  little  reduction  in 
the  design  set  based  error  rate  estimate  is  gained  by  further  splitting. 

4.4.2  Weighing  of  Pe  and  the  number  of  subclasses 

In  this  proposal,  we  will  repeat  the  splitting  until  we  have  at  most  n  <  nmax 
subclasses.  In  each  iteration  a  criterion  combining  the  design  set  based  error 
rate  and  the  number  of  subclasses,  /(Pe(i),i)  is  computed.  Pe(i)  denotes  the 
design  set  based  error  rate  of  the  classifier  used  in  iterat  ion  i  —  c.  The  classifier 
using  k  subclasses, 
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f(Pe(k),k)  =  mjn{f(Pe(i),i)}  i  =  c,---,n  (4.32) 

is  assumed  to  be  the  best  classifier  to  use.  Now,  the  problem  is  how  to  choose  /. 
One  reasonable  strategy  is  obtained  by  adding  Pe  and  a  contribution  involving 
the  number  of  subclasses  together.  A  possible  criterion  may  be  as  follows: 

/(A(i)>0  =  (A(0)'  +  (ir—V"  .  o  <  1  <  1  (4.33) 

The  disadvantage  of  4.33  is  its  sensitivity  of  nmax.  Thus,  one  is  to  be  careful 
when  using  it. 

Another  strategy  is  to  use 


/(A(i)>  >')  =  i/(‘  -  C  +  1)  P,( >')  ,  ?  >  1  ■ 


(4.34) 


It  is  also  easy  to  see  from  4.34  that  the  criterion  becomes  more  independent  of 
the  number  of  subclasses  when  q  is  increasing.  However,  q  should  not  be  too 
small.  Then  this  strategy  will  obviously  weight  Pe  too  little  and  the  number  of 
subclasses  too  much.  For  example,  Pe(c)  =  1.0  will  give  almost  the  same 
criterion  value  as  Pe(c  +  9)  =  0.1.  Eq  4.34  seems  to  be  a  reasonable  criterion  to 
use.  We  also  see,  as  a  special  case,  that  we  will  choose  the  classifier  having  the 
smallest  design  set  based  error  rate  if  q  is  large.  However  it  is  important  to 
notice  that  Pe  is  not  a  good  error  rate  estimate.  Therefore,  it  is  reasonable  to 
assume  that  two  classifiers,  for  which  Pe(i)  ~  Pe(j)  and  Pe(i)  >  P .(j),  i  <  h  do 
have  almost  the  same  error  rate.  Hence  Pe(i)  is  preferable  since  Pe(i)  ~  Pt(j) 
and  less  subclasses  (and  thus  weight  vectors)  are  involved.  This  leads  us  over  to 
another  strategy. 
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4.4.3  Determining  a  sufficiently  good  classifier 

We  are  using  the  performance  in  the  different  iterations,  in  other  words 
pe(c),  •  •  • ,  Pe(n),  n  <  nmai.  n  is  the  number  of  subclasses  when  the  splitting 
process  terminates.  As  argued  previously,  it  is  reasonable  to  assume  a  classifier 
with  design  set  based  error  rate  close  to  min{Pe(t)}  as  sufficiently  adapted  to 
the  data  set.  Therefore  we  will  choose  the  classifier  involving  k  subclasses  where 
k  is  the  lowest  number  of  subclasses  for  which 

Pe(k)  <  max  1 7  mjn  {£(*)}  ,  Pm.nj  (4.35) 

where  i  =  c,  •  •  •  ,n,  Pmin  >  0  and  7  >  1.  As  a  special  case  we  see  that  4.35 
equals  4.34  if  7  =  1,  Pmin  =  0  and  q  =  00. 


5  EVALUATION  OF  THE  CLASSIFIERS 

In  the  previous  chapter  a  new  strategy  for  generating  a  multiclass  piecewise 
linear  classifier  was  developed.  The  evaluation  of  the  classifiers  was  only  based 
on  the  adaptability  to  the  data  set  for  a  given  number  of  discriminant  functions 
(subclasses).  The  error  rate,  computational  speed,  memory  requirement,  etc  was 
not  considered.  However,  all  these  topics  are  necessary  to  study,  and  we  also 
have  to  compare  the  classifiers  derived  in  chapter  4  with  other  classifiers  in 
order  to  evaluate  their  performance.  Therefore,  in  this  chapter  we  will  compare 
the  classifiers  using  both  real  and  synthetic  data  sets.  In  the  tests  involving 
synthetic  data,  Monte  Carlo  simulation  is  used  for  estimating  the  various 
evaluation  criteria.  The  following  10  classifiers  are  evaluated: 
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a)  The  classifiers  derived  in  4.31  -  4.34  (NSMC,  MSE,  W-MSE,  HPC) 

b)  Bayes  classifier  with  known  distribution  (wherever  it  is  known!!)  (B) 

c)  Bayes  classifier  assuming  Gaussian  distributions  (B-G) 

d)  Bayes  classifier  with  probability  densities  estimated  with  the  k-N-N 
method  (B-kNN) 

e)  The  rarest  neighbour  rule  (NNR) 

f)  The  tree  classifier  of  Mizoguchi  et  al  [26]  (TC) 

g)  The  seniority  logic  committee  machine  of  Lee  and  Richard  [24]  (SLCM) 


As  seen,  a  wide  range  of  classifiers  are  chosen  including  a  fast,  easy  computable 
and  often  used  classifier  (B-G),  reliable  and  complex  classifiers  (B-kNN  and 
NNR)  (fast  algorithms  for  finding  the  k  nearest  neighbours  are  also  available, 
e.g.  [22])  as  well  as  “competing”  piecewise  linear  classifiers  (TC  and  SLCM). 
Thus  other  interesting  classifiers  such  as  various  tree  classifiers  [33,  37,  38], 
classifiers  assuming  mixed  Gaussian  distributions  [18,  35]  or  classifiers  based  on 
vector  quantization  [36]  are  excluded  for  several  reasons.  First  of  all,  the 
number  of  classifiers  used  in  the  evaluation  has  to  be  restricted.  Furthermore, 
some  of  the  interesting  classifiers  have  also  been  actualized  after  the  initiation  of 
our  study. 

The  evaluation  criteria  have  been  chosen  as 


a)  Pe  -  the  error  rate  of  a  classifier  using  the  design  set  ( PeD )  or  the  test  set 
(Per). 

b)  P(e\u>)  -  the  conditional  error  rates. 

c)  The  number  of  discriminant  functions  (used  for  NSMC,  MSE,  W-MSE, 
HPC,  TC,  SLCM) 

d)  The  average  number  of  computed  discriminant  functions  (used  for  NSMC, 
MSE,  W-MSE,  HPC,  TC,  SLCM) 
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The  reason  for  using  the  error  rate  is  obvious.  Moreover,  it  is  also  of  interest  to 
study  the  conditional  error  rate.  Then  we  are  able  to  see  how  well  each  class  is 
being  classified  (compared  to  the  optimal  classifier  wherever  it  may  be  found). 
The  two  last  criterions  are  for  evaluating  the  classifiers  with  respect  to 
computational  speed  and  memory  requirement  of  the  different  piecewise  linear 
classifiers. 

In  the  Monte  Carlo  simulations  we  have  computed  -  for  each  criterion  -  both 
the  mean  criterion  value  and  its  standard  deviation. 

The  piecewise  linear  classifiers  derived  in  [24]  and  [26]  are  designed  for  two  class 
problems  only.  Therefore,  the  tests  used  in  this  chapter  are  mainly  based  on  two 
class  problems.  Three  synthetic  data  sets  (two  class  problems)  used  in  the  tests, 
consist  of  samples  drawn  from  Gaussian  distributions,  Gaussian/banana 
distributions  and  banana/banana  distributions.  Two  data  sets  -  both  two  class 
problems  -  involving  real  data  are  also  used.  The  first  data  set  is  derived  from 
geophysical  events  [9]  and  the  second  data  set  is  derived  from  the  silhouettes  of 
two  different  cars.  Finally,  the  classifiers  derived  in  this  thesis,  the  Bayes 
classifier(s)  and  the  nearest  neighbour  rule  are  tested  on  the  synthetic  three 
class  problem  described  in  4.1.4  in  order  to  demonstrate  the  classifiers 
multiclass  properties. 

We  use  the  algorithm  described  in  4.4.3  (7  =  1.2,  Pmin  =  0.02)  for  determining 
the  (final)  classifier.  As  we  remember,  this  strategy  is  only  based  on  the  (design 
set  based)  error  rate  estimate.  Thus,  we  are  able  to  evaluate  both  the  number  of 
discriminant  functions  needed  to  obtain  a  well  adapted  classifier  as  well  as  the 
error  rate. 

For  each  Monte  Carlo  experiment  (each  set  of  distributions),  we  have  made 
tests  using  25  and  150  samples  for  representing  each  class  (in  a  replication). 
Therefore  we  are  able  to  evaluate  the  classifiers  both  for  small  data  sets  and  for 
moderate  to  large  data  sets.  Equal  a  priori  probabilities  have  been  assumed  in 
all  experiments.  Moreover,  500  replications  are  used  in  the  simulations  involving 
150  samples  from  each  class,  and  1000  replications  when  only  25  samples  are 
used  in  each  class. 
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In  the  experiment  using  Gaussian  distributions,  the  samples  representing  wj  is 

9  0 . 

distribution.  The  samples  from  u>2  is  drawn  from 


taken  from  a  N 


a  N 


0 


1 


distribution.  Thus  we  will  have  some  overlap  between 

_  J  Lu.yJ/ 

the  classes.  This  is  an  interesting  situation  for  investigating  how  well  a  classifier 
is  able  to  discriminate  the  classes  without  being  too  adaptable  to  the  design  set. 
The  results  (containing  the  means  and  standard  deviations  of  the  criteria)  from 
the  150-samples  test  are  shown  in  table  5.1. 


The  next  experiment  consists  of  Gaussian  and  banana  distributed  samples.  The 
samples  representing  uj  are  drawn  from  a  banana  distribution  (parameters: 

/*  =  0,  t*o  =  5  and  0  =  180°),  and  the  samples  from  u>2  are  taken  from  a 
standard  Gaussian  distribution  (see  4.1.4).  This  is  an  interesting  situation  both 
because  the  overlap  between  the  classes  is  small  and  because  only  non-linear 
surfaces  are  able  to  discriminate  the  classes  well.  The  results  when  using  150 
samples  for  representing  each  class  in  each  replication  are  shown  in  table  5.2. 


In  the  experiment  presented  in  table  5.3,  all  samples  are  taken  from  banana 
distributions  in  order  to  see  how  the  classifiers  handle  concave  data  sets.  Here, 
only  25  samples  are  representing  each  class  in  each  replication.  The  parameters 
of  the  distribution  of  uq  are:  fx  =  0,  r0  =  5  and  0  =  180°.  The  parameters  of 

the  distribution  of  u>2  are:  p  ^  j  ,  r0  =  5  and  0  =  0°.  In  this  experiment, 

only  classifiers  resulting  in  non-quadratic  discrimination  surfaces  are  able  to 
discriminate  the  classes. 


The  last  synthetic  test  to  be  evaluated  is  the  three  class  problem.  The  results 
from  the  150  samples  experiment  are  to  be  found  in  table  5.4. 


P(e\u\)  P{t  |wa) 


22 


0.1253  0.1323 
0.0205  0.0152 


No  discr  func 

No  comp 

2.92 

2.92 

1.23 

1.23 

2.53 

2.53 

0.72 

0.72 

2.77 

2.77 

0.86 

0.86 

2.49 

2.49 

0.68 

0.68 

Table  5.1:  Results  using  Gaussian  distributed  samples.  The  samples  from  ui\  are 
taken  from  a  N(0,diag[9,l])  distribution,  and  the  samples  from  u>2  are 
drawn  from  a  N([0,4]*,diag[l,9])  distribution.  There  are  150  samples 
in  each  class,  and  500  replications  are  used  in  the  simulations.  Both 
the  mean  and  the  standard  deviation  of  each  criterion  and  classifier 
are  shown. 


Classifier  P(e  jwj) 


NSMC 


P(e  |wj) 


22 

06 


No  discr  func  No  comp 


SLCM 


0.0117 

0.0083 


0.0117 

0.0058 


0.0279 

0.0675 


0.0257 

0.0131 


7.15 

5.27 

2.63 

0.74 

6.13 

2.38 

3.15 

1.42 

Table  5.2:  Results  using  Gaussian/banana  distributed  samples.  The  samples  in 
are  taken  from  a  banana  distribution  with  parameters  n  =0,  r0  = 
5,  ©  =  180°,  and  the  samples  from  u are  drawn  from  a  standard 
Gaussian  distribution.  There  are  150  samples  in  each  class,  and  500 
replications  axe  used  in  the  simulations. 


Classifier  P(e\u>i)  P(e|w2)  Pe 


NSMC 


SLCM  0.0738 
0.0459 


Table  5.3:  Results  using  banana  distributed  samples.  The  samples  in  and  w2 
are  drawn  from  banana  distributions  with  parameters  n  =0,  r0  =  5, 
0  =  180°  and  (jl  =  [—3,  — 5]*,  r0  =  5,  0  =  0°.  There  are  25  samples  in 
each  class,  and  1000  replications  are  used  in  the  simulations. 


Classifier  P(e|u>i) 


NSMC 


^(e|u>2)  P(e|a*) 


W-MSE 


NNR  0.2721 
0.0268 


Table  5.4:  Results  from  the  three  class  problem.  For  details  about  the  distribu¬ 
tions,  please  see  the  text.  There  are  150  samples  in  each  class,  and 
500  replications  are  used  in  the  simulations.  The  two-class  classifiers 
TC  and  the  SLCM  are,  of  courese,  not  included  in  this  test. 


99 


The  first  thing  to  notice  is  that  the  mean  squaxed  error  and  the  weighed  mean 
squared  error  based  classifiers  generally  perform  non-satisfactorily.  The  only 
exception  is  the  first  experiment  (concerning  Gaussian  distributions)  where  they 
work  well  due  to  the  linear  classifier’s  capability  of  discriminating  convex 
distributions. 

However,  the  NSMC  and  the  HPC  perform  satisfactorily,  and  they  work  equally 
well  as  the  B-kNN  and  the  NNR.  The  NNR  perform  better  than  the  NSMC  and 
the  HPC  in  the  experiments  where  there  are  little  overlap  between  the  classes, 
and  the  B-kNN  perform  slightly  better  than  the  NSMC  and  the  HPC  in  all 
experiments.  These  results  show  that  it  seems  possible  to  generate  reliable 
classifiers  which  are  computationally  simple. 

Even  though  the  Gaussian  assumption  in  some  experiments  is  far  from  fulfilled, 
the  Bayes  classifier  with  Gaussian  assumptions  works  well  in  all  experiments 
except  for  the  one  concerning  banana/banana  distributions.  Furthermore, 
according  to  section  2.3,  it  is  almost  as  fast  to  compute  as  our  classifiers. 

Compared  with  the  TC  and  the  SLCM,  our  classifiers  perform  better.  When  it 
is  some  overlap  between  the  classes,  such  as  in  the  experiments  considering 
samples  taken  from  Gaussian  distributions,  the  TC  and  the  SLCM  require 
relatively  large  amount  of  memory  and  computation  time  (relative  to  the 
NSMC  and  the  HPC)  too. 

The  first  data  set  involving  real  data,  contains  311  seismic  events  (113  nuclear 
detonations  and  198  earthquakes)  recorded  at  the  Large  Apparture  Seismic 
Array  (LAS A)  near  Billings,  Montana.  The  duration  of  each  signal  is  60 
seconds.  The  signals  are  sampled  at  20  Hz.  Most  of  the  signal  energy  which  has 
been  found  useful  for  teleseismic  discrimination  is  within  the  frequency  band 
0.3  Hz  to  5.0  Hz.  Therefore,  the  signals  are  resampled  with  10  Hz  after  a 
median  filtering  of  the  original  time  series.  Moreover,  the  signals  are  also  peak 
to  peak  scaled  and  normalized  to  zero  mean. 

The  total  set  of  events  is  randomly  divided  into  a  design  set  and  a  test  set.  The 
distribution  of  events  with  respect  to  class  and  subset  is  given  in  table  5.5. 
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Class 

Design  set 

Test  set 

Explosion 

56 

57 

Earthquake 

93 

105 

Table  5.5:  The  number  of  geoseismic  events  (samples)  in  the  design  set  and  test 
set  for  each  class. 

For  each  time  series,  the  autoregressive  (AR)  coefficients  are  computed.  In  other 
words,  we  use  the  coefficients  a,j  of  an  m’th  order  AR  model 

m— 1 

Xk+I  =  ^2  GjXk-]  +  £k 
j*  0 

where  e,  is  stochastic  and  may  be  viewed  as  the  prediction  error.  These 
coefficients  are  determined  by  Burgs  algorithm  which  uses  the  available  time 
serie  for  minimizing  the  error  power  Pm  given  as: 

n— m 

=  E  4 

fc= l 

The  maximum  entropy  power  spectral  estimate  for  real  input  data  is 
S(  f)  - _ Pm&t _ 

where  A t  denotes  the  sampling  interval.  Furthermore,  the  spectral  ratio  is 
defined  as 

cn  .  Sl£y/sU)<V 

O IV  —  .  ■■ ~~  . 

tSS  \  W)  df 

In  order  to  use  dynamic  information,  the  AR  coefficients  and  the  spectral  ratio 
are  computed  from  windows  taken  at  N  regular  intervals.  These  spectral  ratios 
and  the  bodywave  magnitude  Mb  are  then  combined  in  the  following  N  +  1 
dimensional  feature  vector 
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x  =  [SR1(  SR2,  ■  •  • ,  SR/v,  Mb]*  ■ 


We  have  used  3  spectral  ratios  in  the  feature  vector.  Moreover,  a  two 
dimensional  version  of  the  data  set  is  also  classified  after  applying  the 
Foley-Sammon  transform  for  dimensionality  reduction.  Mor**  details  about  the 
signal  processing  and  the  computation  of  the  feature  vector  may  be  found  in  [9]. 
The  Foley-Sammon  transformed  data  set  is  plotted  in  figure  5.1. 


Figure  5.1:  Scatter  plot  of  the  geoseismic  events.  Class  1  (o)  is  explosions  and 
class  2  (x)  is  earthquakes. 

All  the  different  classifiers  have  been  tested  on  these  data  sets,  and  the  results 
usiDg  the  four  dimensional  data  set  are  shown  in  table  5.6.  The  results  using 
the  two  dimensional  data  set  are  given  in  table  5.7.  In  the  experiments,  the 
a  priori  probability  P{u\)  =  ^  is  assumed. 

These  data  sets  are  useful  for  the  evaluation  of  the  classifiers.  As  can  be  seen 
from  the  plot  in  figure  5.1,  they  seem  to  be  rather  difficult  data  sets  to  classify 
due  to  the  overlap  between  the  classes.  Application  of  the  strategies  developed 
in  this  thesis  shows  (with  one  exception)  that  a  linear  classifier  is  sufficient  for 
the  discrimination.  Intuitively,  from  the  figure,  this  is  a  reasonable  decision  to 
make. 
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Classifier 

P(e|wi) 

P(e\u2) 

P<o 

P 

*  CT 

No  discr  func 

No  comp 

NSMC 

0.3333 

0.0952 

0.1611 

0.1847 

2 

2.00 

MSE 

0.3509 

0.0952 

0.1208 

0.1913 

2 

2.00 

W-MSE 

0.2982 

0.1238 

0.1342 

0.1894 

2 

2.00 

HPC 

0.2281 

0.1905 

0.1678 

0.2046 

2 

2.00 

B-G 

0.3333 

0.1238 

0.1912 

0.2026 

- 

- 

B-kNN 

0.2982 

0.1238 

— 

0.1894 

- 

- 

NNR 

0.3158 

0.1619 

— 

0.2197 

- 

- 

TC 

0.2982 

0.2476 

— 

0.2666 

21 

4.15 

SLCM 

0.4035 

0.2857 

— 

0.3300 

37 

19.85 

Table  5.6:  Results  using  data  set  containing  seismic  events.  The  data  set  with 
four  dimensional  feature  vectors  are  used. 


Classifier 

P(e  M 

P(e|w2) 

P*D 

P 

*  CT 

No  discr  func 

No  comp 

NSMC 

0.2807 

0.1429 

0.1342 

0.1947 

2 

2.00 

MSE 

0.2982 

0.0571 

0.1275 

0.1478 

4 

4.00 

W-MSE 

0.2456 

0.1333 

0.1409 

0.1755 

2 

2.00 

HPC 

0.3158 

0.0606 

0.1208 

0.1544 

2 

2.00 

B-G 

0.2807 

0.1048 

0.1544 

0.1709 

- 

- 

B-kNN 

0.2632 

0.1905 

— 

0.2178 

- 

- 

NNR 

0.2982 

0.1810 

— 

0.2250 

- 

- 

TC 

0.2456 

■0.2286 

— 

0.2350 

35 

4.60 

SLCM 

0.2456 

0.2190 

— 

0.2290 

38 

19.26 

Table  5.7:  Results  using  data  set  containing  seismic  events.  The  data  set  with  two 
dimensional  (Foley  Sammon  transformed)  feature  vectors  are  used. 
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We  also  see  that  the  results  in  general  are  better  by  classifying  the  two 
dimensional  data  set  than  by  classifying  the  four  dimensional  data  set.  This 
effect  probably  results  from  the  increased  number  of  parameters  which  have  to 
be  determined.  However,  there  is  no  (evaluated)  classifier  that  performs 
significantly  better  in  the  four  dimensional  case  than  those  developed  in  the  last 
chapter  (NSMC,  MSE,  W-MSE,  HPC).  Furthermore,  compared  to  the  TC  and 
the  SLCM,  our  classifier  are  much  better. 

In  the  two  dimensional  experiment,  all  our  classifiers  except  the  NSMC  give 
very  low  error  rates.  In  fact,  only  the  the  B-G  has  comparable  results  (relative 
to  our  classifiers)  in  this  experiment.  Due  to  the  heavy  overlap  between  the 
classes,  the  NNR  and  the  B-kNN  provide  a  surprisingly  high  error  rate.  We  also 
notice  that  our  classifiers  requires  much  less  memory  than  the  “competing” 
piecewise  linear  classifiers,  and  that  they  are  significantly  faster  than  the  SLCM. 

The  next  and  last  example  to  be  given  is  the  discrimination  of  two  different  cars 
(Nissan  Sunny  and  Nissan  Prairie).  An  image  sequence  of  these  cars  has  been 
recorded  with  a  CCD-TV  camera.  The  sequence  is  digitized  using  the 
Teragon  4000  Image  Processing  computer  of  NDREs  image  processing  and 
pattern  recognition  group.  200  frames  are  recorded  for  each  car.  Then  each 
frame  is  segmented  in  a  very  simple  way.  First  the  absolute  difference  image  of 
the  frame  and  a  reference  frame  is  thresholded.  A  global  entropy  based 
thresholding  procedure  is  used  for  this  purpose  [21  j.  Next,  noise  is  removed 
from  the  binary  image  by  a  3  X  3  median  filter.  Finally  the  segment(s)  are  filled 
by  use  of  a  logical  filter.  This  procedure  gives  a  relatively  good  vehicle 
silhouette  and  only  a  few  small  noise  segments.  Therefore,  since  each  frame 
contains  one  vehicle  only,  the  largest  connected  segment  in  each  frame  is  treated 
as  a  vehicle  silhouette.  In  figure  5.2,  the  segmentation  result  for  frame  no  1  is 
shown.  Figure  5.2a  shows  the  the  original  TV-image,  figure  5.2b  the  difference 
image,  figure  5.2c  the  thresholded  image,  and  finally  in  figure  5. 2d  the  median 
and  logical  filtered  image  is  shown. 

For  each  segment  several  moments  are  computed.  The  (p  +  <?)'th  central 
moment  ,  Mpq,  is  defined  as 


Figure  5.2:  Segmentation  result  for  frame  no  1  in  the  Sunny /Prairie  example. 


Mvo  =  5Z  (x  -  x)p{y  -  y)q 
(x,v)es 

where  p  >  0,  q  >  0  and  5  denotes  the  set  of  pixels  which  defines  the  segment. 
Moreover,  (x,  y)  denotes  the  centroid  of  the  segment.  We  have  computed  the 
moments  for  which  p  +  q  <  3  only.  It  is  easily  seen  that  Mpq  is  neither  rotation 
invariant  nor  scale  invariant.  Scale  invariance  is  obtained  by  sim'ly  dividing 
Mpq  by  (A/oo)^p+9+2.  Furthermore,  rotation  invariance  is  obtained  by  rotating 
the  coordinate  system  an  angle  a.  Here,  a  denotes  the  angle  between  the 
inertial  axis  of  the  segment  and  the  x-axis.  For  details,  see  [31].  Now,  let  Spq 
denote  a  scale,  rotation  and  translation  invariant  moment.  The  moments  S2 o, 
Sq2>  S21,  S12,  530  and  S03  may  now  be  used  in  the  classification. 

Also  in  this  example,  the  total  data  set  is  randomly  divided  into  a  design  set 
and  a  test  set.  The  distribution  of  samples  with  respect  to  class  and  subset  is 
given  in  table  5.8 
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samples)  in  the  design  set  and  test  set  for  each 

All  feature  combinations  are  tested  using  the  nearest  neighbour  rule  in  order  to 
find  well  performing  feature  candidates.  The  classification  system  IPACS  [10]  is 
used  for  this  task.  The  moments  S2 0,  and  S;2  conjunction  with  the 

Folley-Sammon  transform  [13]  have  been  found  to  be  the  best  feature 
combination  (the  combination  with  the  smallest  error  rate  estimate  using  the 
N-N-R).  In  figure  5.3  the  data  set  is  shown,  and  in  table  5.9  the  results  using 
the  different  classifiers  are  given. 


O  .  005  0.3M 

Figure  5.3:  Scatter  plot  of  the  car  discrimination  problem.  Class  1  (o)  is  the 
Sunny  and  class  2  (x)  is  Prairie. 

All  classifiers,  except  the  NSMC  and  the  SLCM,  managed  to  discriminate  the 
two  cars  satisfactorily.  The  results  obtained  by  using  the  HPC  and  the  W-MSE 
are  encouraging.  In  fact,  these  classifiers  produce  the  best  classification  result 
for  this  particular  data  set. 


1 

Class 

Design  set 

Test  set 

Sunny 

95 

105 

Prairie 

91 

109 

Table  5.8:  The  number  of  cars  ( 
class. 
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Classifier 

P(eM 

P(e|u>2) 

P 

i  ex 

No  discr  func 

No  comp 

NSMC 

0.3238 

0.0734 

0.1708 

0.1986 

5 

5.00 

MSE 

0.1619 

0.1193 

0.1342 

0.1406 

2 

2.00 

W-MSE 

0.0762 

0.1284 

0.0980 

0.1023 

4 

4.00 

HPC 

0.1619 

0.0275 

0.1389 

0.0947 

8 

8.00 

B-G 

0.1238 

0.1009 

0.1129 

0.1124 

— 

— 

B-kNN 

0.2095 

0.0367 

— 

0.1231 

— 

— 

NNR 

0.1429 

0.0826 

— 

0.1128 

— 

— 

TC 

0.1048 

0.1835 

— 

0.1442 

39 

5.08 

SLCM 

0.2381 

0.1376 

— 

0.1879 

52 

26.04 

Table  5.9:  Classification  results  using  the  data  set  containing  features  derived 
from  the  silhouettes  of  two  different  cars. 

To  summarize,  the  majority  of  the  strategies  proposed  in  the  previous  chapter 
produce  good  classification  results  in  all  experiments.  Thus,  we  have  succeeded 
in  designing  fast  and  reliable  classifiers  with  performance  comparable  with  more 
complex  ones. 

However,  we  notice  that  the  MSE-based  classifiers  show  an  unconvenient 
behaviour.  They  both  work  well  only  in  situations  where  a  single  hyperplane  is 
sufficient  to  discriminate  between  t!  ?  classes.  In  other  situations,  at  least  one  of 
them  is  not  useful. 

Moreover,  the  NSMC  and  the  HPC  have  much  better  performance  than  the  TC 
and  the  SLCM,  and  usually  only  a  small  number  of  discriminant  functions  is 
required  for  obtaining  a  reliable  classifier.  Furthermore,  the  HPC  seems  to  be 
slightly  better  than  the  NSMC,  but  in  most  cases  there  is  no  significant 
difference.  However,  in  a  few  situations  the  classifiers  do  not  perform  well.  The 
NSMC  may  become  unreliable  if  the  vector  perpendicular  to  a  suitable 
hyperplane  (separating  a,  and  aj)  differs  significantly  from  the  direction  of 
(/i;  —  fij).  It  is  also  affected  by  scaling  of  the  axes.  Furthermore,  the  linear 
classifier  used  in  the  HPC  occasionally  adapts  too  well  to  the  data  set.  This  may 
produce  an  unreliable  classifier  when  only  a  few  training  samples  are  available. 
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Even  though  the  Gaussian  assumption  often  is  far  from  fulfilled,  the  Bayes 
classifier  with  Gaussian  assumptions  (B-G)  performs  well  in  all  situations  where 
hyperquadratic  surfaces  are  feasible  for  class  separation.  Thus,  in  many  cases 
this  classifier  is  an  alternative  to  the  NSMC  and  the  HPC.  Furthermore,  this 
result  indicates  that  classifiers  based  on  mixtures  of  Gaussian  distributions  also 
will  show  a  good  adaptability.  However,  they  are  most  likely  more  demanding 
concerning  computational  power  than  linear,  quadratic  or  piecewise  linear 
classifiers. 

The  NNR  and  the  B-kNN  perform  somewhat  better  than  our  classifiers  in  all 
examples  except  in  situations  with  large  class  overlap.  Unfortunately,  they  are 
computationally  “heavy”,  and  since  they  need  the  data  set  stored,  their  memory 
requirements  may  cause  problems  in  some  applications. 


6  SUMMARY  AND  CONCLUSION 

In  this  thesis  we  have  developed  a  new  strategy  for  generating  a  piecewise  linear 
classifier.  Our  goal  have  been  to  find  classifier(s)  which  combines  high  reliability 
with  fast  computational  speed.  Generally  speaking,  the  algorithm  is  as  follows: 
First  the  algorithm  attempts  to  discriminate  the  classes  by  using  a  linear 
classifier.  If  this  classifier  shows  poor  performance,  then  the  sample  space  of  one 
of  the  classes  is  split  into  two  disjoint  subsample  spaces.  All  samples  in  a  design 
set  from  a  given  class  lying  in  the  same  (sub)sample  space  are  said  to  belong  to 
the  same  (sub)class.  In  a  practical  situation  the  splitting  is  a  clustering  process 
in  which  the  set  of  samples  representing  a  given  (sub)class  are  separated  into 
two  clusters.  Next,  the  (sub)class  information  is  used  for  generating  a  piecewise 
linear  classifier,  and  its  performance  evaluated.  If  the  performance  is 
unsatisfactorily,  the  sample  space  of  one  of  the  (sub)classes  is  split  and  a  new 
classifier  generated  and  evaluated.  This  process  continues  until  either  a 
maximum  number  of  subclasses  is  reached,  further  splitting  is  impossible,  or  the 
performance  is  expected  not  to  be  significantly  improved  by  further  splitting. 
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Several  problems  had  to  be  solved.  First  we  had  to  develop  an  algorithm  for 
determining  a  suitable  subsample  space  to  split.  Two  different  strategies  have 
been  developed  and  evaluated.  The  first  (and  best!)  one  is  a  hillclimbing 
approach  which  first  evaluates  ail  possible  (sub)sample  spaces  to  split,  and  then 
selects  the  best  one  for  splitting.  In  the  other  approach,  the  sample  space  of  the 
(sub)class  contributing  most  to  the  design  set  based  estimate  of  the  error  rate  is 
split.  Not  surprisingly,  the  hillclimbing  approach  perform  much  better  than  this 
contribution  based  splitting.  The  reason  is  that  the  hillclimbing  approach 
evaluates  all  possible  splits  in  a  given  iteration. 

Several  splitting  algorithms  have  been  developed  and  tested.  We  found  that  the 
best  strategy  is  to  first  split  the  data  set  representing  a  given  (sub)class  in  the 

9 

direction  of  maximum  variance  and  then  maximize  the  distance  from  the  mean 
of  the  data  set  of  the  original  (sub)class  to  the  mean  of  the  nearest  new  subclass 
by  using  an  algorithm  based  on  the  principle  of  evolutionary  search. 

Given  a  set  of  (sub)classes,  four  different  ways  of  generating  a  piecewise  linear 
classifier  have  been  developed  and  studied.  The  simplest  one  is  the  nearest 
sub-mean  classifier  (NSMC).  This  classifier  assigns  a  sample  to  the  class  with 
the  nearest  (sub)mean.  Two  other  strategies,  based  on  the  mean  squared  error 
approach,  have  in  many  situations  been  discarded  due  to  poor  performance  of 
the  corresponding  classifiers.  The  last  strategy  (the  HPC)  is  more  complicated 
to  design.  In  this  approach,  we  first  have  to  find  the  pairs  of  (sub)classes  which 
are  to  be  discriminated  by  a  hyperplane.  Next,  the  hyperplanes  are  generated, 
and  finally  used  for  computation  of  the  piecewise  linear  classifier.  This  classifier 
as  well  as  the  NSMC  have  been  found  to  perform  well  in  most  situations. 


The  evaluation  of  the  classifiers  is  based  on  both  synthetic  and  real  data.  Monte 
Carlo  simulation  techniques  have  been  used  in  the  experiments  involving 
synthetic  data. 


109 


Our  classifiers  have  also  been  compared  with  other  classifiers  such  as  Bayes 
classifier  (wherever  the  probabilities  are  known),  Bayes  classifier  with  estimated 
densities  (based  on  the  k-NN  approach),  Bayes  classifier  with  assumed  Gaussian 
distributions,  the  nearest  neighbour  rule,  as  well  as  with  two  piecewise  linear 
classifiers  designed  for  the  two-class  problem. 

The  HPC  performs  well  in  all  tests.  It  adapts  itself  to  the  data  set  using  only  a 
small  number  of  discriminant  functions.  However,  it  seems  that  the  linear 
classifier  used,  adapts  “too  well”  to  the  design  set  especially  when  only  a  small 
number  of  samples  are  available  for  the  training.  This  may  cause  a  high  error 
rate.  Thus  it  seems  that  the  error  rate  estimate  (based  on  the  design  set)  is  not 
a  sufficiently  good  optimizing  criterion.  However,  to  this  authors  knowledge,  no 
linear  classifier  is  available  which  performs  well  in  all  situations,  but  this 
problem  will  be  a  topic  for  a  future  study.  Also  the  NSMC  shows  a  good 
performance  in  almost  all  tests.  Except  for  the  car  discrimination  example,  it 
manages  to  adapt  itself  to  the  data  set  using  only  a  small  number  of 
aiscriminant  functions.  However,  we  have  seen  that  this  classifier  may  have 
problems  with  the  adaptability  if  the  direction  of  the  vector  perpendicular  to 
the  hyperplane  separating  two  subclasses  (e.g.  a,  and  Oj)  differs  significantly 
from  the  direction  of  /x{  —  fJLy  These  findings  are  contrary  of  the  results 
obtained  for  the  other  two-class  piecewise  linear  classifiers.  These  classifiers 
seem  to  be  quite  complicated  when  there  are  some  overlap  between  the  classes. 
In  some  situations  our  classifiers  even  show  a  better  performance  than  the 
computationally  heavy  nearest  neighbour  rule.  Moreover,  Bayes  classifier  with 
estimated  densities  is  better  in  most  situations,  but  the  difference  is  not  large. 
These  results  encourage  further  studies  of  the  HPC  and  the  NSMC  classifiers. 
Topics  for  such  investigation  may  be  comparisons  with  neural  net  classifiers,  as 
well  as  with  classifiers  based  on  mixtures  of  Gaussian  distributions. 
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A:  A  ROBUST  OUTLIER  DETECTOR 

Commonly  used  outlier  detectors  is  based  on  the  Mahalanobis  distance  between 
a  sample  z  and  the  centroid  /i  (the  expected  value)  of  a  class  w.  The 
Mahalanobis  distance  is  given  as 


£>e(z>m)  =  (z-/i)‘E  l{z-n). 


(A:.l) 
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Now,  z  is  considered  as  an  outlier  if 


Dj>(z,n)  >  d. 


(A:.2) 


In  other  words;  if  z  is  outside  the  di t  ellipsoid,  i.e.  if  there  are  more  than  d 
standard  deviations  between  z  and  /i,  then  z  is  assumed  to  be  an  outlier. 

Usually,  p  and  E  are  unknown.  Therefore,  they  need  to  be  estimated.  Given  a 
data  set  Z  =  {zl5  •  •  •  ,zn},  they  can  be  estimated  the  usual  way,  i.e. 


(A:.3) 


1 

n  -  1 


£(z«  “  /*)(z.  “  /*)‘ 

i=i 


(A:.4) 


However,  A:.3  and  A:.4  are  known  to  be  little  robust.  Therefore,  we  will  try  to 
find  a  more  robust  way  of  detecting  outliers. 

Our  outlier  detector  is  based  on  projecting  of  a  sample  into  a  one  dimensional 
space.  First,  let  us  as  an  alternative  to  A:.l,  project  the  samples  in  Z  into  the 
vector  a  =  Moreover,  let  <r  denote  the  standard  deviation  of  the 

distribution  of  the  projected  sample(s).  Furthermore,  y  is  defined  as  y  =  a'z. 
Then  z  is  considered  as  an  outlier  if 


(a 

a‘Sa 


>d 2 


(A:.5) 


A  robust  estimate  of  a  may  be  easily  obtained.  Fenstad  et  al  [11]  have  studied 
several  standard  deviation  estimators,  and  they  concluded  that  a  quartile  based 
estimator  is  very  robust.  This  result  leads  us  to  a  method  for  one  dimensional 
outlier  detection  (and  rejection)  given  in  [1]  which  has  been  found  to  work  out 
well.  This  algorithm  is  based  on  the  interquartile  difference,  and  we  may  easily 
adapt  it  for  our  d  dimensional  outlier  detection. 


TlS.OO  -12.00  -9.00  -6.00  -5.00  5.00  6.00  9.00  t2.00  lS.00 


Let  z  denote  the  sample  to  be  tested  using  the  data  set  Z  =  {zlt  •  •  ■ ,  zn}. 
Moreover,  let  us  define  a  =  (z  —  n),  yi  =  a'z,  and  the  data  set  y  =  {j/i,  ■  •  • ,  yn}. 
Furthermore,  let  H\  and  H 2  denote  the  lower  and  upper  quartile  of  y 
respectively.  Now,  if  ' 


Hi  -  c(H2  -  Hi)  <  a‘z  <  H2  +  c(H2  -  Hi)  (A:.6) 

then  z  is  not  considered  as  an  outlier. 

In  the  simulations  presented  in  4.2  we  used  c  =  1.5,  and  it  is  shown  that  A:. 6 
manage  to  handle  the  outlier  problem. 


Finally,  let  us  illustrate  our  algorithm  using  100  samples  taken  from  bivariat 
Gaussian  distribution.  The  outlier  rejection  boundary  is  computed  and  the 
result  is  shown  in  figure  A:.l  (c  =  1.5). 


from  a  bivariat  Gaussian  distribution. 
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B:  THE  CONNECTION  BETWEEN  -  p)*(/<2  -  n)  AND  tr{W} 

Let  us  first  define  the  within-class  scatter  matrix  (W),  the  between-class 
scatter  matrix  (B)  and  the  (total)  scatter  matrix  (T)  the  usual  way 

w  =  £  EK  -  #*«)K  -  t*i)* » 

.=1  :=i 

B  =  E^-M)(^-^)‘ 

1=1 

and 

T  =  £>,  -  H)(Z,  -  =  W  +  B 

j=i 

where 

1 

=  7f£2<,  • 

iV«  i=l 

+  N2Ai2)  , 

£  =  (zin  ■  ■  •  zIJVi},  i  =  1,2  ,  Z  =  {zi,  •  •  •  ,z/v}  =  Z\  U  Z2,  Ni  denotes  the 
number  of  samples  in  a  subclass  and  N  is  the  total  number  of  samples 
(N  =  N,  +  N2). 

Now  we  may  state  the  following  theorem: 

Theorem:  Minimizing  the  cost  function  J\  =  (/x,  —  /x)f(/x2  —  n)  is  equivalent 
to  minimizing  the  cost  function  J2  =  tr{W}. 
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Proof: 


It  is  easy  to  be  convinced  that 


min{tr{W}}  =  max{tr{B}}  =  max{tr{T  -  W}} 


Let  us  first  find  an  expression  for  tr{B}. 


tr{B}  =  J2  { [M  (/ii (i)  -  p(0)2]  +  [^2  (Mi)  ~  p(0)2] } 

i=i 

=  J2  K^i(02  +  N2n2(i)2  +  Nn(i)2  -  2  +  N2fi2(i))  n(i) ] 

i=i 

=  ^  [MMi(»)2  +  N2n2(i)2  —  iV/i(t)2]  • 

i=1 


Then,  we  find  an  expression  for  the  inner  product. 


(/ix  - /x)‘(/x2  - /i)  =  Y1  [/*i(i)/*2(0  -  (ni{i)  +  fi2{i))n{i)  +  /i(i)a] 

i=i 

=  E  _  ^0* 1(')  +  MOlW/ii  (0  +  Wjmj  (»))  + 

=  E  [-^.(O2  -  $M02  +  MO2' 

= 


Now,  we  find  that  min{(/*1  -  -  n)}  implies  max{trB}}  which  implies 

min{trW}}  which  in  turn  implies  the  theorem. 
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C:  THE  SINGULAR  VALUE  DECOMPOSITION  METHOD 
USED  FOR  DETERMINING  THE  PIECEWISE  LINEAR 
CLASSIFIER  FROM  A  SET  OF  HYPERPLANES 

Let  us  assume  that  m  pairs  of  subclasses  should  be  separated,  and  that  the  m 
hyperplanes  have  been  generated.  Furthermore,  let  a'kJfc  define  the  hyperplane 
separating  the  subclasses  a,k  and  atjk  (1  <  <  n  and  1  <  it  <  n).  Since  a'kJk  is 

separating  these  two  subclasses,  it  has  to  be  a  solution  of  the  equation 
a«*  —  =  Now,  we  want  to  find  the  weight  vectors  a,,  •  •  •  an  (one  for 

each  subclass)  by  solving  the  following  set  of  equations 


at. 

-  aj. 

= 

a»3 

~  aJ2 

=  a'.Si3 

(C:.l) 

at„ 

“  a  j™ 

=  a' 

•mjm 

As  stated  in  4. 3. 4. 3,  a  solution  of  C:.l  can  be  obtained  by  using  the  singular 
value  decomposition  method  (SVD)  [27].  Transfering  C:.l  to  matrix  form  yields 


V'a  =  a' 


where 


/  a,  \ 

\  a« 


and 


(C:.2) 


a  = 


V  ®*mim  / 
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Moreover,  we  see  that  dim{a}  =  nd  x  1  and  dim{a/}  =  md  x  1.  V'  denotes  a 
md  x  nd  matrix  where  2  elements  differs  from  0  in  a  given  row.  In  fact,  in  row  k 
we  have 


1  ,  /  =  (ih  —  l)d  -f  mod(k  —  1,  d)  +  1 
ujt/  =  -  —  1  ,  l  =  ( jh  —  1  )d  +-  mod(  k  —  1 ,  d)  +  1 
0  ,  otherwise  . 


Here  we  are  assuming  ( h  —  1  )d  +  l  <  k  <  hd ,  h  =  1,  •  •  • ,  m. 

However,  due  to  redundancy,  we  observe  that  m{d  —  1)  of  the  rows  and  n(d  —  1) 
of  the  columns  of  V'  may  be  removed  because  of  redudancy  by  defining 


0  ...  1  ...  0  •••  -1  •••  0  •••  0 


0  •••  0 


0  •••  -1  •••  0 


and  it  is  easy  to  verify  that 


(V'a)  =  ^2  ,  •  •  • »  v^k 
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Now,  let  Q  be  an  m  x  k  orthogonal  matrix,  and  let  P  be  an  n  x  k  orthogonal 
matrix  where  k  =  Rank{V}  <  min{m,n}.  Furthermore,  the  columns  in  Q  and 
P  are  solutions  to  the  eigen  value  problem 

VV‘q  =  A2q 

and 

« 

V‘Vp  =  A2p 


(only  vectors  accosiated  with  an  eigen  value  ^  0  are  considered).  Q  and  P  are 
said  to  consist  of  the  left  and  right  hand  singular  vectors  respectively.  It  may 
now  be  shown  that  V  can  be  decomposed  as 

V  =  QAP‘ 

where  A  =  diag(A!,  •  •  • ,  A*)  is  a  diagonal  matrix  whose  elements  are  the  singular 
values  of  V.  The  generalized  inverse  V+  is  now  defined  as 


V+  =  PA"1Q‘  ,  dim{V+}  =  n  x  m, 


and  moreover,  the  weight  vectors  are  computed  as  as 


m 

a.  =  £  u,ta/  •  (C:-3) 

(=1 


